BSIM explained once and for all!

Executive Summary

This article provides a technical deep dive into Ghidra's Behavioral Similarity (BSIM) feature, released in December 2023. Authored by researchers from SightHouse, the post elucidates the previously undocumented internals of BSIM's C++ implementation and theoretical framework. BSIM enables the identification of semantically equivalent binary functions across different compilers and architectures by lifting code to P-code and generating feature vectors via locality-sensitive hashing. The analysis covers the decompiler process, SLEIGH definitions, and the normalization of instructions into High P-code. While valuable for reverse engineers and malware analysts seeking to understand binary similarity engines, the article does not report on specific cyber threats, threat actors, or active malware campaigns. Consequently, there are no immediate security impacts or mitigations regarding adversarial activity. The content serves purely as educational documentation for enhancing binary analysis capabilities within the Ghidra ecosystem.

Summary

Since its initial released in December 2023, many people have used and built tools around the BSIM feature of Ghidra but up to this date its internals were unknown. This post brings some light on how BSIM works, theoretically and in it's C++ implementation.

Published Analysis

This article provides a technical deep dive into Ghidra's Behavioral Similarity (BSIM) feature, released in December 2023. Authored by researchers from SightHouse, the post elucidates the previously undocumented internals of BSIM's C++ implementation and theoretical framework. BSIM enables the identification of semantically equivalent binary functions across different compilers and architectures by lifting code to P-code and generating feature vectors via locality-sensitive hashing. The analysis covers the decompiler process, SLEIGH definitions, and the normalization of instructions into High P-code. While valuable for reverse engineers and malware analysts seeking to understand binary similarity engines, the article does not report on specific cyber threats, threat actors, or active malware campaigns. Consequently, there are no immediate security impacts or mitigations regarding adversarial activity. The content serves purely as educational documentation for enhancing binary analysis capabilities within the Ghidra ecosystem. Since its initial released in December 2023, many people have used and built tools around the BSIM feature of Ghidra but up to this date its internals were unknown. This post brings some light on how BSIM works, theoretically and in it's C++ implementation. Introduction During our work on SightHouse , we evaluated several binary similarity engines to find one that met our needs. After thorough evaluation, we chose Ghidra's B ehavioral Sim ilarity (BSIM) feature. One key difference of BSIM compared to other approaches is that, despite being open-source, its algorithm is sparsely documented. Existing documentation 1 indicates BSIM uses locality-sensitive hashing and cosine similarity , but the description is brief and incomplete. So here it is, once and for all, BSIM finally explained! All information in this post regarding Ghidra refers to the code in the Ghidra_12.0_build tag on Github. BSIM Overview BSIM is designed to identify whether two binary functions implement the same semantics, regardless of compiler, optimization level, or target architecture. It works by first lifting each function through Ghidra's decompiler to obtain P-code instructions which are Ghidra's architecture-independent Intermediate Representation of the decompiled code 2 . These instructions are considered "raw" or "Low P-code"; the decompiler then normalizes away compiler noise, stripping dead flag computations, abstracting stack mechanics, and producing a clean SSA (Static Single Assignment) dataflow graph. This refined form of P-code is called "High P-code". It shares the same grammar as raw P-code but is rewritten into a cleaner, normalized form, with a few notable differences, for instance, the MULTIEQUAL operation (Phi-node) only appears in High P-code. Once generated, BSIM iterates over these refined instructions and incrementally hashes them into a "feature vector" (a vector of integer hash values). These feature hashes form a function fingerprint, which is stored in a database (local, PostgreSQL, or Elasticsearch). When querying for similar functions, BSIM retrieves candidates from the database by comparing feature vector similarity scores. The result is a similarity score between 0 and 1 that reliably identifies semantically equivalent functions. The figure below presents the different steps of the BSIM pipeline: The next parts of the blog post breaks down these two steps: feature generation and how the resulting vectors are compared. Ghidra Architecture To understand how BSIM works, we need to explain how Ghidra operates. Ghidra is mainly written in Java, except for a few components including the decompiler, which is written in C++. The decompiler sources are located under Ghidra/Features/Decompiler/src/decompile/cpp , referred to later in this post as DECOMP_DIR . The interaction between these two environments uses a small custom serial protocol that reads input from the decompiler process's stdin and returns results on stdout . The implementation is available at DECOMP_DIR/ghidra_process.cc . Whenever Ghidra needs to decompile a function, it spawns (or reuses) one of the decompiler processes. It sends all necessary information (raw bytes, processor definitions, address spaces, etc.) to that process and then displays the decompilation results in the UI. The decompiler loads a SLEIGH 3 definition corresponding to the processor identifier (for example, x86:LE:64:default ). SLEIGH is a processor description language originally based on SLED 4 but refined for Ghidra's needs. SLEIGH has two main goals: enabling disassembly and decompilation. For decompilation, SLEIGH specifies the translation from machine instructions into P-code. P-code is a register-transfer language (RTL) designed to capture the semantics of machine instructions in a uniform, processor-independent form. Code for different processors can be translated straightforwardly into P-code, allowing a single suite of analysis tools to perform...