{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "3227e661-958f-4b07-b3f7-910381eb728b",
   "metadata": {},
   "source": [
    "# A gentle introduction to SMT-based program analysis\n",
    "February 12, 2023&nbsp; - &nbsp;Arnau Gàmez i Montolio<br>\n",
    "[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)\n",
    "<hr>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1113a542-1098-4ddf-aff9-efb76aa475f4",
   "metadata": {
    "tags": []
   },
   "source": [
    "This article offers a quick introduction to SMT-based program analysis. The goal is not to build or dive deep into the internal workings of an SMT solver. Rather, we want to explore how an SMT solver can be leveraged to reason about code properties in practice. Thus, for the most part, an SMT solver will be considered as a *magic black-box* that can answer a very simple question, as we will see shortly.\n",
    "\n",
    "Before starting with the actual content, just a quick notation reminder.\n",
    "\n",
    "| Operation   | Math | Code |\n",
    "|------|------|------|\n",
    "| AND  | $\\land$  | \\& |\n",
    "| OR   | $\\lor$   | \\| |\n",
    "| NOT  | $\\neg$   | \\~  |\n",
    "| XOR  | $\\oplus$ | \\^  |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b2903916-6966-4f76-89d5-97b9f8c2f81e",
   "metadata": {
    "tags": []
   },
   "source": [
    "## SAT\n",
    "To give an appropriate context, let's start by talking briefly about the SAT problem (short for boolean **sat**isfiability problem).\n",
    "\n",
    "Consider some boolean variables:\n",
    "$$A, B, C$$\n",
    "\n",
    "Consider a bunch of boolean expressions, which we will call **constraints**:\n",
    "\\begin{align*}\n",
    "    \\varphi_1 &= (A \\lor \\neg B) \\land C\\\\\n",
    "    \\varphi_2 &= (\\neg A \\lor B) \\land (A \\lor C) \\land (B \\lor \\neg C)\n",
    "\\end{align*}\n",
    "\n",
    "These expressions come in [conjunctive normal form (CNF)](https://en.wikipedia.org/wiki/Conjunctive_normal_form), which just means that are written as *ANDs* of *ORs* of (possibly negated) variables. We will not discuss why CNF is used, but let's just say it is a standard representation with *nice* properties. Also, any boolean formula can be translated into an equivalent CNF one, so we can use it without loss of generality.\n",
    "\n",
    "\n",
    "Taking this into account, the *only* thing that a SAT solver does is answering a very simple **question**: *is the set of constraints $\\{\\varphi_1, \\varphi_2\\}$ satisfiable?*\n",
    "\n",
    "In other words: *is there any (boolean) variable assignment (for $A, B, C$) such that all constraints ($\\varphi_1, \\varphi_2$) hold true (i.e. $\\varphi_1 \\land \\varphi_2 = 1$)?*\n",
    "\n",
    "There are three possible **outcomes** to this question:\n",
    "\n",
    "- SAT: Yes, it is satisfiable, i.e. there is a variable assignment that makes all the constraints to hold true.\n",
    "    - If SAT, the solver will actually find a *model*, which is a particular solution (a concrete variable assignment).\n",
    "$$[A = 1, B = 1, C = 1]$$\n",
    "    \n",
    "- UNSAT: No, it is not satisfiable, i.e. there is NO variable assignment that makes all the constraints to hold true.\n",
    "- UNKNOWN: Unable to answer the question (usually due to a time-out)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8101faad-2282-4ef1-9231-9c45fd2a8e3a",
   "metadata": {},
   "source": [
    "## SMT\n",
    "\n",
    "Now that we have a basic understanding of what does thinking about satisfiability look like, let's introduce SMT and SMT solvers.\n",
    "\n",
    "First, SMT stands for **s**atisfiability **m**odulo **t**heories. *Modulo* is one of these buzzwords that mathematicians and people working on formal methods like to use a lot (guilty here). In practice, you can vaguely just read *modulo* as \"taking into account\" whatever comes later. In this particular case, we have satisfiability (which we already have an intuition for) but *taking into account* (other; more) theories.\n",
    "\n",
    "Thus, SMT allows to reason about satisfiability in a broader sense than SAT. That is, satisfiability with respect to (taking into account) expressions that *live* in different realms (theories) other than just plain boolean expressions. And what are these theories? Essentially, frameworks that define the variables and operations (and the rules to combine them) at our disposal. Some examples of theories are *bit vectors*, *integer numbers*, *real numbers* or *floating point*.\n",
    "\n",
    "> **Note**: The theory of integer numbers is a model for the [abstract math integers](https://en.wikipedia.org/wiki/Integer), not the [programming integers](https://en.wikipedia.org/wiki/Integer_(computer_science)) of a fixed size of 32 or 64 bits. Similarly, the theory of real numbers deals with the [abstract real numbers](https://en.wikipedia.org/wiki/Real_number), not any kind of arbitrary-precision floating point.\n",
    "\n",
    "For the most part, program analysis relies on the **theory of bit vectors**. Here, the variables are fixed-size bit vectors, i.e. variables thought as arrays (vectors) of *n* bits; *n*-bit values. The theory also includes the usual arithmetic and logic operations, as well as the operations of concatenation (e.g. join two 32-bit values into a 64-bit value) and extraction (e.g. retrieve the lowest 8 bits from a 32-bit variable).\n",
    "\n",
    ">**Note**: In the theory of bit vectors, the notion of (un)signedness is not inherent to the variable itself (which is simply a raw array of bits), but rather a property of the operations (e.g. signed or unsigned comparisons)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9867f6e8-eaea-4183-9379-4846a796a8db",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Program analysis with the help of an SMT solver\n",
    "\n",
    "The best way to understand how such technology can be leveraged for program analysis is by going over some concrete examples using an actual SMT solver.\n",
    "\n",
    "For this purpose, we will use the [Z3 Theorem Prover](https://en.wikipedia.org/wiki/Z3_Theorem_Prover), an [open source](https://github.com/z3prover/z3) SMT solver by Microsoft. We mainly choose Z3 because it provides a great Python API which is easy to [install](https://pypi.org/project/z3-solver/), use and read, even having no prior experience with SMT solvers or [SMT-LIB standard language (syntax)](https://smtlib.cs.uiowa.edu/language.shtml).\n",
    "\n",
    "So, let's get to it!\n",
    "\n",
    "The first thing we need to do is importing everything from Z3 within our python shell/script."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "b13f8d5c-3d58-4aad-9042-090c03a5e524",
   "metadata": {},
   "outputs": [],
   "source": [
    "from z3 import *"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "32a9834b-72f2-4f7c-adfa-1f209ef01d19",
   "metadata": {},
   "source": [
    "#### Input crafting\n",
    "\n",
    "Use the SMT solver to generate a set of inputs that satisfies some property or produces a desired output.\n",
    "\n",
    "**Problem 1**\n",
    "\n",
    "Given the following C function:\n",
    "```C\n",
    "bool check(uint32_t x, uint32_t y)\n",
    "{\n",
    "    return x*x*y + (3+x)*(7+y) == 1337;\n",
    "}\n",
    "```\n",
    "\n",
    "Find input values `x, y` such that the function `check` returns `true`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b916e4f2-6739-4e7c-9a87-1cd7bb5b877f",
   "metadata": {},
   "source": [
    "**Solution 1**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "94a92ff8-f3fd-4a34-8e65-6915f9cc6450",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "x = 204374014\n",
      "y = 334463242\n"
     ]
    }
   ],
   "source": [
    "# Define x, y as bit vectors of size 32 bits\n",
    "x, y = BitVecs('x y', 32)\n",
    "\n",
    "# Initialize the solver\n",
    "solver = Solver()\n",
    "\n",
    "# Add constraint\n",
    "solver.add(x*x*y + (3+x)*(7+y) == 1337)\n",
    "\n",
    "# Check if the set of constraints is satisfiable\n",
    "if solver.check() == sat:\n",
    "    # Get model and print it nicely\n",
    "    m = solver.model()\n",
    "    print(f\"x = {m[x]}\\ny = {m[y]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6594daeb-2c88-4745-8f09-c62ce6fcac29",
   "metadata": {},
   "source": [
    "> **Note**: The solution (model) does not have to be unique. Actually, if you run this code several times, it is very probable that you get different solutions. Remember, the only thing that a SAT outcome guarantees is that some solution (model) exists for the set of constraints, and it will find a concrete one (among possibly a large amount of them)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a6f1bc56-d2ed-48e0-b805-116aac729ff9",
   "metadata": {},
   "source": [
    "**Problem 2**\n",
    "\n",
    "Consider the following C function:\n",
    "```C\n",
    "bool verify_license(char *code)\n",
    "{\n",
    "    size_t sz = strlen(code);\n",
    "    if (sz != 6) return false;\n",
    "    \n",
    "    char flag = 'A';\n",
    "    for (int i = 0; i < sz; i++)\n",
    "    {\n",
    "        if ((code[i] >= '0' && code[i] <= '9') ||\n",
    "            (code[i] >= 'A' && code[i] <= 'Z'))\n",
    "        {\n",
    "            flag = flag ^ code[i];\n",
    "        }\n",
    "        else\n",
    "        {\n",
    "            return false;\n",
    "        }\n",
    "    }\n",
    "    \n",
    "    return flag == 'Z';\n",
    "}\n",
    "```\n",
    "\n",
    "Find valid input *codes*, i.e. build a [keygen](https://en.wikipedia.org/wiki/Keygen)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3d3b788-a813-47bb-a6b3-87b5007edbf3",
   "metadata": {},
   "source": [
    "**Solution 2**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "c5301c8d-aabe-45b3-9d2d-895db205dd48",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "code = E2JJ4X\n"
     ]
    }
   ],
   "source": [
    "# Define code\n",
    "# List (array) of 6 bit vectors of size 8 bits (char) \n",
    "code = [BitVec(f\"c{i}\", 8) for i in range(6)]\n",
    "\n",
    "# Define flag\n",
    "# Bit vector of size 8 bits with a concrete value 'A'\n",
    "flag = BitVecVal(ord('A'), 8)\n",
    "\n",
    "# Initialize the solver\n",
    "solver = Solver()\n",
    "\n",
    "# Iterate over all chars of code\n",
    "for c in code:\n",
    "    # Add constraint\n",
    "    # The character has to lay within the valid range\n",
    "    solver.add(Or(\n",
    "        And(c >= ord('0'), c <= ord('9')),\n",
    "        And(c >= ord('A'), c <= ord('Z'))\n",
    "    ))\n",
    "    \n",
    "    # Update the flag bit vector\n",
    "    flag = flag ^ c\n",
    "\n",
    "# Add constraint\n",
    "# The end value of flag has to be 'Z'\n",
    "solver.add(flag == ord('Z'))\n",
    "\n",
    "# Check if the set of constraints is satisfiable\n",
    "if solver.check() == sat:\n",
    "    # Get model and print it nicely\n",
    "    m = solver.model()\n",
    "    s = ''.join(chr(m[c].as_long()) for c in code)\n",
    "    print(f\"code = {s}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "186dbb5c-1b87-4f3a-8f28-28afd0ebe012",
   "metadata": {},
   "source": [
    "> **Note**: If you have ever done any reverse engineering challenge, you might be familiar with reversing key verification algorithms in order to build a key generation algorithm. The nice thing about this SMT solver approach is that you do not need to reverse anything. Rather, just encode the very same verification algorithm as a set of constraints to be satisfied and let the SMT solver generate valid keys for you."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "27e8b928-0d99-4320-9ff9-74e1721e5456",
   "metadata": {},
   "source": [
    "#### Model counting\n",
    "\n",
    "Use the SMT solver to count how many solutions exist for a given set of constraints.\n",
    "\n",
    "The idea is to run the satisfiability check in a loop, saving (counting) the current solution and adding at each iteration a constraint to exclude it from subsequent iterations.\n",
    "\n",
    "**Problem 3**\n",
    "\n",
    "Consider the previous `verify_license` function.\n",
    "\n",
    "How many valid *codes* exist that start with `ABCD`? Print them."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "abc50e4f-fdc0-4d3d-936b-7951b4314236",
   "metadata": {},
   "source": [
    "**Solution 3**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "012b70d0-abfa-452d-9050-8fba61e3ffcc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "22 valid codes start with ABCD:\n",
      "---\n",
      "ABCDGX\n",
      "ABCDFY\n",
      "ABCDRM\n",
      "ABCDVI\n",
      "ABCDWH\n",
      "ABCDOP\n",
      "ABCDNQ\n",
      "ABCDJU\n",
      "ABCDZE\n",
      "ABCDXG\n",
      "ABCDHW\n",
      "ABCDIV\n",
      "ABCDKT\n",
      "ABCDLS\n",
      "ABCDTK\n",
      "ABCDEZ\n",
      "ABCDMR\n",
      "ABCDYF\n",
      "ABCDQN\n",
      "ABCDUJ\n",
      "ABCDPO\n",
      "ABCDSL\n"
     ]
    }
   ],
   "source": [
    "# ... run previous code ...\n",
    "# Add the extra constraints\n",
    "solver.add(code[0] == ord('A'))\n",
    "solver.add(code[1] == ord('B'))\n",
    "solver.add(code[2] == ord('C'))\n",
    "solver.add(code[3] == ord('D'))\n",
    "\n",
    "# Define a list to hold the valid solutions\n",
    "s = []\n",
    "\n",
    "# While the set of constraints is satisfiable\n",
    "while solver.check() == sat:    \n",
    "    # Get model\n",
    "    m = solver.model()\n",
    "    \n",
    "    # Save solution\n",
    "    s.append(''.join(\n",
    "        chr(m[c].as_long()) for c in code)\n",
    "    )\n",
    "    \n",
    "    # Add constraint\n",
    "    # Exclude current solution\n",
    "    solver.add(\n",
    "        Not(And([c == m[c] for c in code]))\n",
    "    )\n",
    "\n",
    "# Print the outcome\n",
    "print(f\"{len(s)} valid codes start with ABCD:\\n---\")\n",
    "print(\"\\n\".join(s))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f374299-3226-444e-a2e5-c88250cfb107",
   "metadata": {},
   "source": [
    "#### Semantic equivalence\n",
    "\n",
    "Use the SMT solver to check if two expressions are semantically equivalent, i.e. whether for every possible (same) input they produce the same output.\n",
    "\n",
    "The naive approach would be to simply add to the solver a constraint where the two expressions are equated.\n",
    "```python\n",
    "solver.add(exp1 == exp2)\n",
    "if (solver.check() == sat): # INCORRECT\n",
    "```\n",
    "\n",
    "But beware! This is not correct. Here we are asking the SMT solver whether it exists **some** variable assignment (inputs) to the expressions `exp1` and `exp2` such that their output matches. But that is not what we are looking for. We want to check whether the expressions are semantically equivalent, i.e. that their outcome will match for **all** possible inputs.\n",
    "\n",
    "Instead, we need to use as a constraint that the two expressions are different (i.e. we will ask whether it exists a variable assignment (inputs) such that the two expressions produce a different outcome). Then, we will verify that this constraint is unsatisfiable. This means that there is no way of giving the same input to the expressions and have them produce a different output. In other words, for every possible (same) input they produce the same output, as we want.\n",
    "\n",
    "```python\n",
    "solver.add(exp1 != exp2)\n",
    "if (solver.check() == unsat): # CORRECT\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7f1efcac-e88c-40e5-8663-ac3d378cee8a",
   "metadata": {},
   "source": [
    "**Problem 4**\n",
    "\n",
    "Consider the following C functions, which define obfuscated operations:\n",
    "\n",
    "```C\n",
    "uint32_t obf_op1(uint32_t x, uint32_t y)\n",
    "{\n",
    "    return (x ^ y) + 2*(x & y);\n",
    "}\n",
    "\n",
    "uint32_t obf_op2(uint32_t x, uint32_t y)\n",
    "{\n",
    "    return (x | y) - y + (~x & y);\n",
    "}\n",
    "```\n",
    "\n",
    "Find out the underlying semantics of `obf_op1` and `obf_op2`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b8273ce2-0eac-4b32-8e48-6c22bca677d1",
   "metadata": {},
   "source": [
    "**Solution 4**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "a6c3dab1-2e10-4c88-839b-f78b14d2d52c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "x + y == (x ^ y) + 2*(x & y)\n",
      "x ^ y == (x | y) - y + (~x & y)\n"
     ]
    }
   ],
   "source": [
    "# Return true if the expressions are equivalent\n",
    "def check_semantic_equivalence(exp1, exp2):\n",
    "    solver = Solver()\n",
    "    solver.add(exp1 != exp2)\n",
    "    return solver.check() == unsat\n",
    "\n",
    "# Define x, y as bit vectors of size 32 bits\n",
    "x, y = BitVecs('x y', 32)\n",
    "\n",
    "# Define a list of operations to check against\n",
    "basic_operations = [\n",
    "    x + y,\n",
    "    x - y,\n",
    "    x * y,\n",
    "    x & y,\n",
    "    x | y,\n",
    "    x ^ y\n",
    "]\n",
    "\n",
    "# Define the two obfuscated operations\n",
    "obf_op1 = (x ^ y) + 2*(x & y)\n",
    "obf_op2 = (x | y) - y + (~x & y)\n",
    "\n",
    "# Check if the obfuscated expressions match the semantics\n",
    "# of some of the basic operations\n",
    "for op in basic_operations:\n",
    "    if check_semantic_equivalence(op, obf_op1):\n",
    "        print(f\"{op} == {obf_op1}\")\n",
    "        \n",
    "    if check_semantic_equivalence(op, obf_op2):\n",
    "        print(f\"{op} == {obf_op2}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "66112683-b3f0-48d9-bb5f-c5578b9db08c",
   "metadata": {},
   "source": [
    "#### Detect opaque predicates\n",
    "\n",
    "Use the SMT solver to detect false branching (and unreachable code) through the use of obfuscation based on [opaque predicates](https://en.wikipedia.org/wiki/Opaque_predicate).\n",
    "\n",
    "In a nutshell, an opaque predicate is a (non-trivial) conditional statement for which the truth value is known a priori by the author, but concealed to the analyst."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "35f1cf19-12e8-4ae4-beed-c7816078c418",
   "metadata": {},
   "source": [
    "**Problem 5**\n",
    "\n",
    "Consider the following C function:\n",
    "\n",
    "```C\n",
    "uint8_t opaques(uint8_t x, uint8_t y)\n",
    "{\n",
    "    uint8_t z;\n",
    "\n",
    "    if ((uint8_t)(151 * (39 * ((x ^ y) + 2 * (x & y)) + 23) + 111) >\n",
    "        (uint8_t)((x ^ y) + 2 * (x & y)))\n",
    "    {\n",
    "        z = x & y;\n",
    "    }\n",
    "\n",
    "    else if ((uint8_t)(x-y + 2*(~x&y) - (x^y)) == 0x17)\n",
    "    {\n",
    "        z = x | y;\n",
    "    }\n",
    "\n",
    "    else if (\n",
    "        (uint8_t)(195 +\n",
    "                  97*x +\n",
    "                  159*y +\n",
    "                  194*~(x | ~y) +\n",
    "                  159*(x ^ y) +\n",
    "                  (163 + x + 255*y + 2*~(x | ~y) + 255*(x ^ y))*\n",
    "                  (232 + 248*x + 8*y + 240*~(x | ~y) + 8*(x ^ y)) - 57) < 100\n",
    "    )\n",
    "    {\n",
    "        z = x ^ y;\n",
    "    }\n",
    "\n",
    "    else {\n",
    "        z = 0;\n",
    "    }\n",
    "\n",
    "    return z;\n",
    "}\n",
    "```\n",
    "\n",
    "Provide an equivalent function cleaning all false branching."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "95143f79-b268-4493-8a73-168547771807",
   "metadata": {},
   "source": [
    "**Solution 5**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "f220d555-e601-4525-8290-5c418b2ef42a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ALWAYS FALSE\n",
      "---\n",
      "151*(39*((x ^ y) + 2*(x & y)) + 23) + 111 >\n",
      "(x ^ y) + 2*(x & y)\n",
      "\n",
      "ALWAYS FALSE\n",
      "---\n",
      "x - y + 2*(~x & y) - (x ^ y) == 23\n",
      "\n",
      "ALWAYS TRUE\n",
      "---\n",
      "195 +\n",
      "97*x +\n",
      "159*y +\n",
      "194*~(x | ~y) +\n",
      "159*(x ^ y) +\n",
      "(163 + x + 255*y + 2*~(x | ~y) + 255*(x ^ y))*\n",
      "(232 + 248*x + 8*y + 240*~(x | ~y) + 8*(x ^ y)) -\n",
      "57 <\n",
      "100\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Detect conditional statements that are always false\n",
    "def check_false_predicate(p):\n",
    "    solver = Solver()\n",
    "    solver.add(p)\n",
    "    return solver.check() == unsat\n",
    "\n",
    "# Detect conditional statements that are always true\n",
    "def check_true_predicate(p):\n",
    "    solver = Solver()\n",
    "    solver.add(Not(p)) # <-- Not(p)\n",
    "    return solver.check() == unsat\n",
    "        \n",
    "# Define x, y as bit vectors of size 8 bits\n",
    "x, y = BitVecs('x y', 8)\n",
    "\n",
    "# Define the conditional statements to check\n",
    "predicates = [\n",
    "    # if ...\n",
    "    151*(39*((x ^ y) + 2*(x & y)) + 23) + 111 >\n",
    "    (x ^ y) + 2*(x & y),\n",
    "    \n",
    "    # else if ...\n",
    "    x-y + 2*(~x&y) - (x^y) == 0x17,\n",
    "\n",
    "    # else if ...\n",
    "    195 +\n",
    "    97*x +\n",
    "    159*y +\n",
    "    194*~(x | ~y) +\n",
    "    159*(x ^ y) +\n",
    "    (163 + x + 255*y + 2*~(x | ~y) + 255*(x ^ y))*\n",
    "    (232 + 248*x + 8*y + 240*~(x | ~y) + 8*(x ^ y)) - 57 < 100\n",
    "]\n",
    "\n",
    "# Iterate over all conditional statements\n",
    "# Check whether they are opaque predicates\n",
    "for p in predicates:\n",
    "    if check_false_predicate(p):\n",
    "        print(f\"ALWAYS FALSE\\n---\\n{p}\\n\")\n",
    "        \n",
    "    elif check_true_predicate(p):\n",
    "        print(f\"ALWAYS TRUE\\n---\\n{p}\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2ca7af8-e79e-4379-ac78-18d7df923603",
   "metadata": {},
   "source": [
    "Thus, the first two branches where `z = x & y` and `z = x | y` are unreachable. Also, the third branch where `z = x ^ y` will always be reached, meaning that fourth branch where `z = 0` is unreachable too. In the end, we can remove all branching and end up with an equivalent function which simply computes and returns the XOR of two 8-bit input values.\n",
    "\n",
    "```C\n",
    "uint8_t deobf_opaques(uint8_t x, uint8_t y)\n",
    "{\n",
    "    uint8_t z = x ^ y;\n",
    "    return z\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ab459fb-df0c-46f9-9dbf-35c5002f8844",
   "metadata": {},
   "source": [
    "#### Encode control-flow\n",
    "Use the SMT solver to directly encode control-flow constructs. Within Z3's Python API, this can be achieved through the following syntax:\n",
    "\n",
    "`If([condition], [then], [else])`.\n",
    "\n",
    "**Problem 6**\n",
    "\n",
    "Define a function that encodes the previous `opaques` function using control-flow constructs in Z3. Then, check that it is semantically equivalent to the XOR expression `x ^ y`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c5fed855-1e9b-4b14-b8ba-8ee6d37b933c",
   "metadata": {},
   "source": [
    "**Solution 6**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "91d5d692-b860-46ed-b257-e227ae4d94cc",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Define the opaques function\n",
    "# Nested Z3's control-flow constructs\n",
    "def opaques(p1, p2, p3):\n",
    "    return If(p1, x & y, If(p2, x | y, If(p3, x ^ y, 0)))\n",
    "\n",
    "# Define x, y as bit vectors of size 8 bits\n",
    "x, y = BitVecs('x y', 8)\n",
    "\n",
    "# Unpack the three conditional statements\n",
    "p1, p2, p3 = [p for p in predicates]\n",
    "\n",
    "# Verify that the function is equivalent to an XOR\n",
    "check_semantic_equivalence(\n",
    "    opaques(p1, p2, p3),\n",
    "    x ^ y\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "814bd7ec-48f0-4547-b49c-265a61b48058",
   "metadata": {},
   "source": [
    "### Final notes\n",
    "\n",
    "SMT solvers are way more powerful and expressive than we have shown here. For instance, they usually accept encoding [first-order](https://en.wikipedia.org/wiki/First_order_logic) predicates. In practice, that means that we can use [quantifiers](https://en.wikipedia.org/wiki/Quantifier_(logic)) within our constraints (e.g. _**for all** possible values of x, **exists** a value of y such that ..._).\n",
    "\n",
    "The code examples are in `C` deliberately, for the sake of accessibility. However, the concepts discussed easily apply to any other language. In particular, our motivation to study and use SMT solvers comes mostly from a security research and reverse engineering point-of-view, when dealing with obfuscated code in assembly.\n",
    "\n",
    "You might be wondering how can we extract such expressions when dealing with binaries and assembly code. Essentially, there are two main approaches:\n",
    "\n",
    "- Dirty way: from pseudocode retrieved from a [decompiler](https://en.wikipedia.org/wiki/Decompiler).\n",
    "- Formal(-ish) way: leveraging [symbolic execution](https://en.wikipedia.org/wiki/Symbolic_execution).\n",
    "\n",
    "In particular, symbolic execution is a technique that can be applied to *any* language (high-level programming language, assembly of some architecture, or even [intermediate representations](https://en.wikipedia.org/wiki/Intermediate_representation)). It has its (practical) limitations, but in general allows to semi-automatically extract formulas (constraints) from the control-flow and data-flow of a program (function, basic block...) of a given language. Also, symbolic execution engines provide a seamless interface into one (or more) SMT solvers. Thus, the whole program analysis process of extracting constraints and performing queries to the SMT solvers can be done in a sane way (i.e. not having to copy-paste around), especially when we are interested in mid-large scale automation.\n",
    "\n",
    "Maybe (hopefully) we can talk about symbolic execution another time."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ee319b04-64b6-432e-b12f-b52fc229bc6c",
   "metadata": {},
   "source": [
    "#### References\n",
    "\n",
    "- Online articles on the Z3 API in Python. [Basics](https://ericpony.github.io/z3py-tutorial/guide-examples.htm) and [advanced topics](https://ericpony.github.io/z3py-tutorial/advanced-examples.htm).\n",
    "- The book [SAT/SMT by Example](https://sat-smt.codes/), a really fun compendium of problems and puzzles, and how to translate them into constraints for a SAT/SMT solver."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7e98ea22-0ff8-4289-a429-57974d3153da",
   "metadata": {},
   "source": [
    "#### Training\n",
    "\n",
    "Last but not least, a shameless self-plug to [our software protection and deobfuscation trainings](https://furalabs.com/trainings), where we leverage SMT solvers, among many other tools and techniques, within the context of binary analysis."
   ]
  }
 ],
 "metadata": {
  "authors": [
   {
    "name": "Arnau Gàmez i Montolio"
   }
  ],
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.1"
  },
  "nbformat": 4,
  "nbformat_minor": 5,
  "title": "A gentle introduction to SMT-based program analysis",
  "toc-autonumbering": false,
  "toc-showcode": false,
  "toc-showmarkdowntxt": false,
  "toc-showtags": false,
  "vscode": {
   "interpreter": {
    "hash": "e7370f93d1d0cde622a1f8e1c04877d8463912d04d973331ad4851f04de6915a"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}