February 12, 2023 - Arnau Gàmez i Montolio
CC BY 4.0
[ This article is also available as an interactive jupyter notebook ]
This article offers a quick introduction to SMT-based program analysis. The goal is not to build or dive deep into the internal workings of an SMT solver. Rather, we want to explore how an SMT solver can be leveraged to reason about code properties in practice. Thus, for the most part, an SMT solver will be considered as a magic black-box that can answer a very simple question, as we will see shortly.
Before starting with the actual content, just a quick notation reminder.
Operation | Math | Code |
---|---|---|
AND | ∧ | & |
OR | ∨ | | |
NOT | ¬ | ~ |
XOR | ⊕ | ^ |
To give an appropriate context, let's start by talking briefly about the SAT problem (short for boolean satisfiability problem).
Consider some boolean variables: A,B,C
Consider a bunch of boolean expressions, which we will call constraints: φ1=(A∨¬B)∧Cφ2=(¬A∨B)∧(A∨C)∧(B∨¬C)
These expressions come in conjunctive normal form (CNF), which just means that are written as ANDs of ORs of (possibly negated) variables. We will not discuss why CNF is used, but let's just say it is a standard representation with nice properties. Also, any boolean formula can be translated into an equivalent CNF one, so we can use it without loss of generality.
Taking this into account, the only thing that a SAT solver does is answering a very simple question: is the set of constraints {φ1,φ2} satisfiable?
In other words: is there any (boolean) variable assignment (for A,B,C) such that all constraints (φ1,φ2) hold true (i.e. φ1∧φ2=1)?
There are three possible outcomes to this question:
Now that we have a basic understanding of what does thinking about satisfiability look like, let's introduce SMT and SMT solvers.
First, SMT stands for satisfiability modulo theories. Modulo is one of these buzzwords that mathematicians and people working on formal methods like to use a lot (guilty here). In practice, you can vaguely just read modulo as "taking into account" whatever comes later. In this particular case, we have satisfiability (which we already have an intuition for) but taking into account (other; more) theories.
Thus, SMT allows to reason about satisfiability in a broader sense than SAT. That is, satisfiability with respect to (taking into account) expressions that live in different realms (theories) other than just plain boolean expressions. And what are these theories? Essentially, frameworks that define the variables and operations (and the rules to combine them) at our disposal. Some examples of theories are bit vectors, integer numbers, real numbers or floating point.
Note: The theory of integer numbers is a model for the abstract math integers, not the programming integers of a fixed size of 32 or 64 bits. Similarly, the theory of real numbers deals with the abstract real numbers, not any kind of arbitrary-precision floating point.
For the most part, program analysis relies on the theory of bit vectors. Here, the variables are fixed-size bit vectors, i.e. variables thought as arrays (vectors) of n bits; n-bit values. The theory also includes the usual arithmetic and logic operations, as well as the operations of concatenation (e.g. join two 32-bit values into a 64-bit value) and extraction (e.g. retrieve the lowest 8 bits from a 32-bit variable).
Note: In the theory of bit vectors, the notion of (un)signedness is not inherent to the variable itself (which is simply a raw array of bits), but rather a property of the operations (e.g. signed or unsigned comparisons).
The best way to understand how such technology can be leveraged for program analysis is by going over some concrete examples using an actual SMT solver.
For this purpose, we will use the Z3 Theorem Prover, an open source SMT solver by Microsoft. We mainly choose Z3 because it provides a great Python API which is easy to install, use and read, even having no prior experience with SMT solvers or SMT-LIB standard language (syntax).
So, let's get to it!
The first thing we need to do is importing everything from Z3 within our python shell/script.
from z3 import *
Use the SMT solver to generate a set of inputs that satisfies some property or produces a desired output.
Problem 1
Given the following C function:
bool check(uint32_t x, uint32_t y)
{
return x*x*y + (3+x)*(7+y) == 1337;
}
Find input values x, y
such that the function check
returns true
.
Solution 1
# Define x, y as bit vectors of size 32 bits
x, y = BitVecs('x y', 32)
# Initialize the solver
solver = Solver()
# Add constraint
solver.add(x*x*y + (3+x)*(7+y) == 1337)
# Check if the set of constraints is satisfiable
if solver.check() == sat:
# Get model and print it nicely
m = solver.model()
print(f"x = {m[x]}\ny = {m[y]}")
x = 204374014 y = 334463242
Note: The solution (model) does not have to be unique. Actually, if you run this code several times, it is very probable that you get different solutions. Remember, the only thing that a SAT outcome guarantees is that some solution (model) exists for the set of constraints, and it will find a concrete one (among possibly a large amount of them).
Problem 2
Consider the following C function:
bool verify_license(char *code)
{
size_t sz = strlen(code);
if (sz != 6) return false;
char flag = 'A';
for (int i = 0; i < sz; i++)
{
if ((code[i] >= '0' && code[i] <= '9') ||
(code[i] >= 'A' && code[i] <= 'Z'))
{
flag = flag ^ code[i];
}
else
{
return false;
}
}
return flag == 'Z';
}
Find valid input codes, i.e. build a keygen.
Solution 2
# Define code
# List (array) of 6 bit vectors of size 8 bits (char)
code = [BitVec(f"c{i}", 8) for i in range(6)]
# Define flag
# Bit vector of size 8 bits with a concrete value 'A'
flag = BitVecVal(ord('A'), 8)
# Initialize the solver
solver = Solver()
# Iterate over all chars of code
for c in code:
# Add constraint
# The character has to lay within the valid range
solver.add(Or(
And(c >= ord('0'), c <= ord('9')),
And(c >= ord('A'), c <= ord('Z'))
))
# Update the flag bit vector
flag = flag ^ c
# Add constraint
# The end value of flag has to be 'Z'
solver.add(flag == ord('Z'))
# Check if the set of constraints is satisfiable
if solver.check() == sat:
# Get model and print it nicely
m = solver.model()
s = ''.join(chr(m[c].as_long()) for c in code)
print(f"code = {s}")
code = E2JJ4X
Note: If you have ever done any reverse engineering challenge, you might be familiar with reversing key verification algorithms in order to build a key generation algorithm. The nice thing about this SMT solver approach is that you do not need to reverse anything. Rather, just encode the very same verification algorithm as a set of constraints to be satisfied and let the SMT solver generate valid keys for you.
Use the SMT solver to count how many solutions exist for a given set of constraints.
The idea is to run the satisfiability check in a loop, saving (counting) the current solution and adding at each iteration a constraint to exclude it from subsequent iterations.
Problem 3
Consider the previous verify_license
function.
How many valid codes exist that start with ABCD
? Print them.
Solution 3
# ... run previous code ...
# Add the extra constraints
solver.add(code[0] == ord('A'))
solver.add(code[1] == ord('B'))
solver.add(code[2] == ord('C'))
solver.add(code[3] == ord('D'))
# Define a list to hold the valid solutions
s = []
# While the set of constraints is satisfiable
while solver.check() == sat:
# Get model
m = solver.model()
# Save solution
s.append(''.join(
chr(m[c].as_long()) for c in code)
)
# Add constraint
# Exclude current solution
solver.add(
Not(And([c == m[c] for c in code]))
)
# Print the outcome
print(f"{len(s)} valid codes start with ABCD:\n---")
print("\n".join(s))
22 valid codes start with ABCD: --- ABCDGX ABCDFY ABCDRM ABCDVI ABCDWH ABCDOP ABCDNQ ABCDJU ABCDZE ABCDXG ABCDHW ABCDIV ABCDKT ABCDLS ABCDTK ABCDEZ ABCDMR ABCDYF ABCDQN ABCDUJ ABCDPO ABCDSL
Use the SMT solver to check if two expressions are semantically equivalent, i.e. whether for every possible (same) input they produce the same output.
The naive approach would be to simply add to the solver a constraint where the two expressions are equated.
solver.add(exp1 == exp2)
if (solver.check() == sat): # INCORRECT
But beware! This is not correct. Here we are asking the SMT solver whether it exists some variable assignment (inputs) to the expressions exp1
and exp2
such that their output matches. But that is not what we are looking for. We want to check whether the expressions are semantically equivalent, i.e. that their outcome will match for all possible inputs.
Instead, we need to use as a constraint that the two expressions are different (i.e. we will ask whether it exists a variable assignment (inputs) such that the two expressions produce a different outcome). Then, we will verify that this constraint is unsatisfiable. This means that there is no way of giving the same input to the expressions and have them produce a different output. In other words, for every possible (same) input they produce the same output, as we want.
solver.add(exp1 != exp2)
if (solver.check() == unsat): # CORRECT
Problem 4
Consider the following C functions, which define obfuscated operations:
uint32_t obf_op1(uint32_t x, uint32_t y)
{
return (x ^ y) + 2*(x & y);
}
uint32_t obf_op2(uint32_t x, uint32_t y)
{
return (x | y) - y + (~x & y);
}
Find out the underlying semantics of obf_op1
and obf_op2
.
Solution 4
# Return true if the expressions are equivalent
def check_semantic_equivalence(exp1, exp2):
solver = Solver()
solver.add(exp1 != exp2)
return solver.check() == unsat
# Define x, y as bit vectors of size 32 bits
x, y = BitVecs('x y', 32)
# Define a list of operations to check against
basic_operations = [
x + y,
x - y,
x * y,
x & y,
x | y,
x ^ y
]
# Define the two obfuscated operations
obf_op1 = (x ^ y) + 2*(x & y)
obf_op2 = (x | y) - y + (~x & y)
# Check if the obfuscated expressions match the semantics
# of some of the basic operations
for op in basic_operations:
if check_semantic_equivalence(op, obf_op1):
print(f"{op} == {obf_op1}")
if check_semantic_equivalence(op, obf_op2):
print(f"{op} == {obf_op2}")
x + y == (x ^ y) + 2*(x & y) x ^ y == (x | y) - y + (~x & y)
Use the SMT solver to detect false branching (and unreachable code) through the use of obfuscation based on opaque predicates.
In a nutshell, an opaque predicate is a (non-trivial) conditional statement for which the truth value is known a priori by the author, but concealed to the analyst.
Problem 5
Consider the following C function:
uint8_t opaques(uint8_t x, uint8_t y)
{
uint8_t z;
if ((uint8_t)(151 * (39 * ((x ^ y) + 2 * (x & y)) + 23) + 111) >
(uint8_t)((x ^ y) + 2 * (x & y)))
{
z = x & y;
}
else if ((uint8_t)(x-y + 2*(~x&y) - (x^y)) == 0x17)
{
z = x | y;
}
else if (
(uint8_t)(195 +
97*x +
159*y +
194*~(x | ~y) +
159*(x ^ y) +
(163 + x + 255*y + 2*~(x | ~y) + 255*(x ^ y))*
(232 + 248*x + 8*y + 240*~(x | ~y) + 8*(x ^ y)) - 57) < 100
)
{
z = x ^ y;
}
else {
z = 0;
}
return z;
}
Provide an equivalent function cleaning all false branching.
Solution 5
# Detect conditional statements that are always false
def check_false_predicate(p):
solver = Solver()
solver.add(p)
return solver.check() == unsat
# Detect conditional statements that are always true
def check_true_predicate(p):
solver = Solver()
solver.add(Not(p)) # <-- Not(p)
return solver.check() == unsat
# Define x, y as bit vectors of size 8 bits
x, y = BitVecs('x y', 8)
# Define the conditional statements to check
predicates = [
# if ...
151*(39*((x ^ y) + 2*(x & y)) + 23) + 111 >
(x ^ y) + 2*(x & y),
# else if ...
x-y + 2*(~x&y) - (x^y) == 0x17,
# else if ...
195 +
97*x +
159*y +
194*~(x | ~y) +
159*(x ^ y) +
(163 + x + 255*y + 2*~(x | ~y) + 255*(x ^ y))*
(232 + 248*x + 8*y + 240*~(x | ~y) + 8*(x ^ y)) - 57 < 100
]
# Iterate over all conditional statements
# Check whether they are opaque predicates
for p in predicates:
if check_false_predicate(p):
print(f"ALWAYS FALSE\n---\n{p}\n")
elif check_true_predicate(p):
print(f"ALWAYS TRUE\n---\n{p}\n")
ALWAYS FALSE --- 151*(39*((x ^ y) + 2*(x & y)) + 23) + 111 > (x ^ y) + 2*(x & y) ALWAYS FALSE --- x - y + 2*(~x & y) - (x ^ y) == 23 ALWAYS TRUE --- 195 + 97*x + 159*y + 194*~(x | ~y) + 159*(x ^ y) + (163 + x + 255*y + 2*~(x | ~y) + 255*(x ^ y))* (232 + 248*x + 8*y + 240*~(x | ~y) + 8*(x ^ y)) - 57 < 100
Thus, the first two branches where z = x & y
and z = x | y
are unreachable. Also, the third branch where z = x ^ y
will always be reached, meaning that fourth branch where z = 0
is unreachable too. In the end, we can remove all branching and end up with an equivalent function which simply computes and returns the XOR of two 8-bit input values.
uint8_t deobf_opaques(uint8_t x, uint8_t y)
{
uint8_t z = x ^ y;
return z
}
Use the SMT solver to directly encode control-flow constructs. Within Z3's Python API, this can be achieved through the following syntax:
If([condition], [then], [else])
.
Problem 6
Define a function that encodes the previous opaques
function using control-flow constructs in Z3. Then, check that it is semantically equivalent to the XOR expression x ^ y
.
Solution 6
# Define the opaques function
# Nested Z3's control-flow constructs
def opaques(p1, p2, p3):
return If(p1, x & y, If(p2, x | y, If(p3, x ^ y, 0)))
# Define x, y as bit vectors of size 8 bits
x, y = BitVecs('x y', 8)
# Unpack the three conditional statements
p1, p2, p3 = [p for p in predicates]
# Verify that the function is equivalent to an XOR
check_semantic_equivalence(
opaques(p1, p2, p3),
x ^ y
)
True
SMT solvers are way more powerful and expressive than we have shown here. For instance, they usually accept encoding first-order predicates. In practice, that means that we can use quantifiers within our constraints (e.g. for all possible values of x, exists a value of y such that ...).
The code examples are in C
deliberately, for the sake of accessibility. However, the concepts discussed easily apply to any other language. In particular, our motivation to study and use SMT solvers comes mostly from a security research and reverse engineering point-of-view, when dealing with obfuscated code in assembly.
You might be wondering how can we extract such expressions when dealing with binaries and assembly code. Essentially, there are two main approaches:
In particular, symbolic execution is a technique that can be applied to any language (high-level programming language, assembly of some architecture, or even intermediate representations). It has its (practical) limitations, but in general allows to semi-automatically extract formulas (constraints) from the control-flow and data-flow of a program (function, basic block...) of a given language. Also, symbolic execution engines provide a seamless interface into one (or more) SMT solvers. Thus, the whole program analysis process of extracting constraints and performing queries to the SMT solvers can be done in a sane way (i.e. not having to copy-paste around), especially when we are interested in mid-large scale automation.
Maybe (hopefully) we can talk about symbolic execution another time.
Last but not least, a shameless self-plug to our software protection and deobfuscation trainings, where we leverage SMT solvers, among many other tools and techniques, within the context of binary analysis.