How to write your first obfuscator of Java Bytecode

In this article I describe Java bytecode obfuscation, using one of the challenges I did in 2023 as part of the interviews with Quarkslab for the position of Java compiler engineer in QShield. Introduction In the middle of my PhD back in 2023, I was writing a static analysis tool for Android's Dalvik EXecutable Format and someone from the LLVM community recommended that I talk about the topic at the EuroLLVM conference dedicated to the LLVM compilation framework, because my analysis tool used part of this framework as an Intermediate Representation. After this conference, someone recommended that I apply for a Java Compiler Engineer position in Quarkslab's Qshield team, to develop obfuscations for Java, and in Java... Transforming Java Bytecode As part of the interview process, one of the tasks was to write a simple obfuscation for Java bytecode. Having previously written a disassembler for Android Dalvik bytecode, I thought I could quickly learn about Java bytecode and how to manipulate it for obfuscation purposes. One of the requirements was to use the ASM library, which is specifically designed for Java bytecode analysis and manipulation. The obfuscation technique to implement was opaque predicates . Before diving into the opaque predicate implementation, this section covers the necessary background on three topics: Java bytecode , comparisons between bytecode and assembly , and the ASM library. Java bytecode Java is an object-oriented language which is mostly intended to write once, run everywhere , in opposite to other languages like C or C++ that are compiled for a specific architecture (the code is translated to a binary representation for a processor architecture). In Java, the code is compiled to an intermediate representation (also known as bytecode) that can be run on any device hosting a Java Virtual Machine. This Java Virtual Machine translates the Java bytecode to instructions designed for the host architecture while running. The bytecode itself is a sequence of instructions designed for a stack-based virtual machine. Each instruction consists of a one-byte opcode followed by zero or more operands (this makes the Java Bytecode instruction set very small compared with some computer architectures like x86). The JVM specification defines around 200 opcodes, though not all byte values are currently used. These instructions operate on a few key data areas: the operand stack (where most operations happen), local variables, and the constant pool. Let's look at a simple example. Consider this Java method: public int add ( int a , int b ) { return a + b ; } It compiles to the following bytecode: 0 : iload_1 1 : iload_2 2 : iadd 3 : ireturn The iload_1 and iload_2 instructions push the first and second integer parameters (these are retrieved from the local variables previously mentioned, where local 0 is usually used for the this object) onto the operand stack. The iadd instruction pops these two values, adds them, and pushes the result back. Finally, ireturn returns the integer value from the top of the stack. public int max ( int a , int b ) { if ( a > b ) { return a ; } else { return b ; } } Compiles to: 0 : iload_1 1 : iload_2 2 : if_icmple 7 5 : iload_1 6 : ireturn 7 : iload_2 8 : ireturn The control flow is handled by if_icmple (if integer compare less than or equal), which compares the two values on the stack and jumps to bytecode offset 7 if the condition is true. Notice how the conditional jump targets are explicit bytecode offsets, making the control flow graph straightforward to reconstruct. The stack operations ( iload_1 , iload_2 ) and typed comparisons ( if_icmple specifically for integers) make it clear what is being compared and how. Java bytecode has explicit control-flow instructions and uses a stack-based architecture for parameter passing. Additionally, Java bytecode includes type information that is preserved during compilation. These characteristics make it significantly easier to disassemble and decompile compared to native machine code—decompilers can leverage the type metadata and structured control flow to reconstruct high-level Java source code with accuracy. For this reason, obfuscating Java bytecode before releasing a Java-based product is important if we want to prevent our code from being easily analyzed. Bytecode vs Assembly: Key Differences Execution model : Assembly uses registers from the processor directly (like eax , ebx in x86), while Java bytecode operates on a virtual stack from the Java Virtual Machine. In assembly, we explicitly move data between registers; in bytecode we push and pop from the virtual stack. Portability : Assembly is architecture-specific (e.g. ARM code won't run on x86). Bytecode runs anywhere the JVM is available. Memory access : Assembly provides direct memory access with pointers and addresses. Bytecode abstracts this away, we work with object references and array indices, with bounds checking built in, etc. Verification : Bytecode is...

How to write your first obfuscator of Java Bytecode

Summary

Published Analysis