SightHouse: Automated function identification

In this blog post we present SightHouse, an open-source tool designed to assist reverse engineers by retrieving information and metadata from programs and identifying similar functions already known from other libraries, binaries or any other source codes that can be found online. Introduction SightHouse's logo Whether you are new to reverse engineering or have years of experience, you have likely encountered a common challenge: distinguishing relevant software components from third-party libraries within firmware or programs. This task can be highly challenging and time-consuming as unnecessary code is often reversed. Software evolves rapidly, compelling reverse engineers to continuously adapt. Modern programs are complex, requiring analysis of thousands of functions and layers of abstraction introduced by SDKs and new programming languages like Rust or Golang. Additionally, while LLM-generated code accelerates development, it tends to produce repetitive, often vulnerable patterns across models 1 , leaving reverse engineers to sift through yet another source of redundant code. To address this challenge, numerous approaches have emerged over the years: spanning from IDA Flirt 2 , released in 1996, to the latest innovations in the Large Language Model (LLM) era we're experiencing today. Most of these static analysis approaches aim to solve the Binary Similarity problem. The latter involves identifying similar functions based on a given representation, such as raw bytes, assembly code, Intermediate Representation (IR), or source code. However, choosing the right tool is not straightforward, as each solution has its own strengths and limitations. Once you have selected a specific algorithm for your needs, it is often necessary to compute a large database of known function signatures to make the tool effective. The creation and maintenance of these signature databases can be particularly challenging for researchers, as they need to continuously identify, compile, and extract new signatures from programs. Moreover, the reverse engineering ecosystem is fragmented, which limits collaboration and contribution among reverse engineers. Many available solutions are tightly coupled with specific Software Reverse Engineering (SRE) tools like IDA Pro, Binary Ninja, or Ghidra. This fragmentation can hinder the broader adoption and integration of these tools across different workflows. To address these challenges, we present SightHouse, a new function identification tool designed to automate the creation of signature databases and seamlessly integrate with your preferred SRE environment. Choosing the right tool We stand on the shoulders of giants. As mentioned earlier, many tools have emerged over the years, and we aimed to identify the best fit for our specific use cases. First and foremost, the algorithm needed to be free and open-source, with a permissive license allowing integration into our project. This constraint ruled out commercial solutions like IDA Pro or Binary Ninja. We sought a solution that could handle multiple architectures while ultimately providing a cross-architecture capability (for example, enabling comparisons between x86 and ARM32 of memcpy ). Additionally, the algorithm needed to be scalable, capable of supporting server-based queries from multiple clients, and deliver strong performance even when processing millions of functions. To evaluate potential solutions, we benchmarked approaches that represent the state-of-the-art in academia, such as jTrans 3 or GMN 4 , as well as more "industrial" ones like FunctionSimSearch 5 , FunctionID 6 , and BSIM 7 . For our experiments, we created a new dataset using projects from PlatformIO 8 , a software aggregator for embedded projects, to include architectures like ARM, RISC-V, and XTensa. We also added well-known projects such as glibc , sqlite , openssl , curl , and zlib , all compiled for x86. This resulted in 9,775 programs, 379,822 functions, and 782 MB of storage. We duplicated the dataset, stripped the symbols, and then applied each algorithm to reassign function names. Some might argue that using the same dataset for both signature extraction and comparison is problematic (a known issue in traditional machine learning). However, we did not use this dataset for training any models. Instead, the results of each algorithm were contextually independent, relying solely on mathematical computations. Furthermore, some algorithms are designed to recognize specific byte sequences, which means they would fail if those sequences do not appear in the final database. Here are the results of our experiments. For those unfamiliar with the chosen metrics, here is a short explanation: Precision : Measures the ability to retrieve accurate matches. Recall : Indicates how effectively the algorithm identifies all instances of the same function. F1-Score : Represents the harmonic mean between Precision and Recall, providing a balanced measure of both accuracy and...

SightHouse: Automated function identification

Summary

Published Analysis