LS8 - Software Testing

Software Testing (? days)

Software Testing
Dynamic Testing
Coverage Collection

  • (up until coverage granularity spectrum for quiz)
    Directed Greybox Fuzzing
    Fuzz Testing - Sanitizers
    Symbolic Execution
  • BG: SAT and SMT solvers
    Misc. Testing Topics
    Mock Testing
    Future Directions and Research

Slides

Summary

Fuzzer Summary:
  • Fuzzing is an effective testing mechanism
  • Unlike other testing approaches it can find new inputs that cause program misbehavior
  • Mutation and coverage are important for generating effective fuzzers
  • Directed greybox fuzzers (DGF) can target specific parts of the codebase
  • Sanitizer provide additional information needed to find silent UB bugs
Symbolic Execution
  • Symbolic execution explores paths systematically using symbolic inputs + SMT solvers
  • Path constraints define equivalence classes - one test input per class suffices
  • KLEE: practical symbolic execution on LLVM bitcode
  • Challenges: path explosion, loops, environment modeling, solver limitations

Software Testing

Process of finding bugs and vulnerabilities in software

“Program testing can be used to show the presence of bugs, but never to show their absence”

  • E. W. Dijkstra

Types:

  • Testing based on dynamic analysis
    • Software executed, monitor for assertion failures and other bugs during execution
    • Unit testing, integration testing, fuzz testing
  • Testing based on static analysis
    • Source code analyzed without execution
    • Analysis identifies known bug patterns: uninitialized variables, unsafe data flow, null pointer dereferences, etc.
    • Dta flow analysis, control flow analysis, abstract interpretation
  • Hybrid: Symbolic Execution

Dynamic Testing

For example with JUnit:

  • Test individual funcs
  • Adds assertions to check it conditions hold
  • Test corner-cases manually create
    Difficult to cover all code with unit testing (LOTS and LOTS of code, Linux Kernel ~30mil)
  • automated testing is a sub-field of automated software engineering

Silent Failures:

  • As opposed to ‘loud’ bugs that cause crashes/assertion failures, some bugs are silent

Detecting silent (UB) bugs requires both, exercising the buggy code with the right inputs AND an oracle to identify the buggy execution

Memory safety:

  • Languages like C lack memory-safety
    • Buffer-overflow
    • Use-after-free
    • Double-free
  • None of the above bugs necessarily crash the program

Fuzz Testing

Goal: Find app inputs that reveal bugs

  • Test the entire app instead of individual functions
  • General inputs randomly until an error is uncovered

Bugs revealed:

  • Incorrect argument validation, incorrect type casting
  • Mem bugs (overflows, leaks, …)
  • Logic errors through assertion failures
  • Applications written in any language can be fuzzed
Blackbox fuzzing
  • Simple loop invoking app with new inputs
  • No internal visibility into the app
    Pros:
  • Simple, generating random inputs couldnt be easier
    Cons:
  • Random inputs are not good inputs
    • IE imagine trying to create a SQL command with random inputs, it would just be rejected without even executing
Mutation fuzzer

Seeds
Start with valid inputs as ‘seeds’
Mutate them based on some ‘strategy’

  • Shuffle bytes, erase bytes, insert byte, change bytes
    Pros:
  • Simple to implement
  • Good for binary inputs and streams (JPEG file encoder/decoder)
    Cons:
  • Generates many invalid inputs for applications that expect structured inputs

Dictionaries:

  • Provide a dict of ‘keywords’ (like SQL: CREATE/DELETE… or HTML tags)
  • Mutate input by replacing/adding a chosen keyword from this dictionary
    Pros:
  • Still simple
  • Works reasonably well for structured input
    Cons:
  • Still generates many invalid inputs

Grammar-aware fuzzing

  • Define the input format as a CFG (BNF rules)
  • Generate inputs by randomly expanding production rules, will only generate syntactically valid inputs
    • (can imagine it working for SQL/HTML/etc.)
      Pros:
  • Always syntactically valid, tests deeper code
  • Can weight rules to target certain features
    Cons:
  • Must write/maintain a grammar for each input format
  • Only generates valid inputs, doesnt test invalid inputs
  • Generating valid inputs is expensive
Fuzzer Success Factors
  • Fuzzer throughput

  • Percentage of valid inputs generated by the mutator

    • Obvious tradeoff between the two (generating valid input is costly, but more often effective)
  • Also, test ‘depth’ and code coverage

  • ‘Most promising’ fuzz inputs - simple heuristic is code coverage

    • Amount of code/execution paths explored durign testing
  • Input seeds that increase coverage are ‘more promising’

    • High coverage = high bug-finding potential
  • Almost an RL / evolutionary algorithm?
Coverage guided Fuzzing

Record coverage
Prioritize input seeds which give higher coverage
Gradually uncover all program paths
Also known as ‘greybox fuzzing’

AFL

American Fuzzy lop (AFL)
Fuzzer for C/C++ applications

Coverage Collection

Control flow graph (CFG)

  • One possible representation of program execution

CFG consists of basic blocks (BBs):

  • e.g. - Entry (of a func), if statements, variable updated, exit, etc.
    During fuzzing, if new path is discovered that expands CFG, it considers the input “interesting”
  • Fuzzer records at the BB level
Compilers

Many automated software engineering techniques, including fuzzing, leverage the compiler

afl-gcc is a wrapper around gcc

  • It gets inserted into the compilation process
  • Adds a ‘coverage collector’ into the generated binary

Records the code coverage generated by taking certain edges in the CFG

Path Coverage

Track the exact sqeuence of branches taken

  • Same branches, different order = different path
  • Bugs might depend on the exact path

However, leads to path explosion

  • N branches has up to 2^N possible paths
    • Loops too, every iteration is a new path
Coverage Granularity Spectrum
  • Branch/edge coverage: which branches were taken (AFL)
  • Value profile: how close comparisons are to being satisfied (libFuzzer)
  • Path coverage: exact sequence of branches (most precise, but exponential complexity)

Weigh fine grained coverage vs performance overhead

  • Depends on application

Directed Greybox Fuzzing (DGF)

Greybox fuzzing can’t be directed, works autonomously
Directed fuzzing has many applications

  • Path testing: reach change statements
  • Crash reproduction: reproduce reported bugs
  • Sensitive functions and APIs
DGF Compiler Instrumentation

Extract call graph (CG) and control-flow graph (CFGs)

  • Compute distance to target locations
  • Instrument program to aggregate distance

CG - Directed graph where:

  • Nodes represent functions
  • Edges represent calls (edge from A to B means function A calls B)

DGF - Directed Greybox Fuzzing:

  • Identify target functions
  • For each function compute the shortest path to target
  • For each run, fuzzer compares the BBs executed for the shortest distane to the target
    • Keep seeds with lower distance
    • Discard seeds with higher distance
    • Executions progress towards a target

Fuzz Testing - Sanitizers

Memory safety bugs: Fuzzers excute program with random inputs and look for crashes, but bugs may or may not crash program

  • Undefined Behavior (UB)
Sanitizers

Sanitizers - Oracles for memory safety bugs

  • Adds additional compiler instrumentation to record object metadata
  • Turn silent UBs to crashes
  • Popular options
    • Address Sanitizer (ASAN) - can detect buffer overflow
    • Leak Sanitizer (LSAN) - can detect memory leak
    • Thread Sanitizer (TSAN) - can detect data races
  • SIGNIFICANTLY slows down the application, but can detect bugs

Sanitizers - Insert additional instrumentation

  • Implemented as an additional compiler pass in afl-clang

Address Sanitizer
Add code to:

  • Record size of heap buffers
  • Check if an out-of-bounds access occurs

Option 1:

  • Maintain metadata for each object
    • e.g. a hashmap: {obj_addr: len}
  • Instrument heap allocation with hashmap insterion
  • Instrument pointer dereference with hashmap lookup
  • Very expensive

ASAN is a sanitizer used with popular fuzzers

  • Insert “red/poison zones” around memory
  • Updates to red zones indicate buffer overflow
  • Shadow memory
    • Record metadata about red and accessible zones
    • Each byte in shadow memory corresponds to 8 bytes in app
      • 0x00 (valid); 0xFF (poisoned range); 0x01-0x07 (partially poisoned)
  • Compiler adds instruction before memory access to check if shadow memory byte is poisoned
    • Checking shadow memory is very fast (know exactly where to look, and what to check once there)

Fuzzer success factors:

  • Fuzzer throughput (exec/s)
  • Percentage of valid inputs generated by the mutator
  • Type of coverage recorded
  • Sanitizer vs. no sanitizer
  • Ideally, want a high percentage of valid inputs + fine-grained code and data coverage + sanitizers
  • Unfortunately reduces throughput
  • Tradeoff: finding the right balance for your use case!

Symbolic Execution

Fuzzer limitations:

  • Fuzzing generates inputs randomly - effective but incomplete
  • Struggles with:
    • Magic constants (x==0xDEADBEEF) - Difficult to hit randomly
    • Deep paths - Reaching code guarded by many conditions
    • Checksums and hash comparisons - random bytes rarely pass validation

Can we systematically generate inputs that comprehensively explores different program paths

Background - SAT and SMT solvers

SAT - Given a formula in CNF, find a satisfying assignment

  • SAT is NP-complete, worst case exponential
  • SAT solvers are foundation for SMT solvers
    SMT - Generalizes SAT, boolean variables are replaced by predicates over richer theories
  • Input: First order formula over background theories
  • Output: Is the formula satisfiable? If so provide assignments
    Common theories in program analysis:
  • Arithmetic - linear and nonlinear integer arithmetic
  • Arrays - read/write operations on arrays
  • Bit-vectors - fixed width binary arithmetic (models machine integers exactly)
  • Uninterpreted functions - function equality. f(a) = f(b) if a=b
    Example SMT:
    • examples of satisfying (x,y):
      • (8,1), (6,2), (4,3)
    • Saying:
      • write(A, 3, 42) -> A’
        • A’[x] = A[x] where x != 3
        • A’[x] = 42 where x == 3
    • Satisfiable: i=3

Popular SMT solvers:

  • Z3, STP, and more
Symbolic Execution

Execute with symbolic inputs instead of concrete values:

  • x=5 turns into x= (symbolic var)

As execution proceeds

  • Variables hold symbolic expressions
  • At each branch, the engine forks into two paths
  • Each path accumulates a path constraint

Workflow

  • Mark inputs as symbolic
  • Engine executes symbolically
  • At branch: query solver
  • Fork into feasible paths
  • At path end: solve constraint
  • Solver returns concrete input (test case)
    Implementations: KLEE, S2E
Concrete ExecutionSymbolic Execution
VariablesConcrete (x=5, y=3)Symbolic (x=a, y=b, z=2b)
PathsOne path through programExplores all paths
TestsOne input at a timeEach path = equivalence class of inputs
SpeedFast - native executionSlower - needs SMT solver queries
TypesFuzzing, unit testingSymbolic execution known as whitebox testing
Symbolic execution finding bugs

For each dangerous operation: insert a check and fork

  • Division: Fork on divisor == 0
  • Array access: Fork on index < 0 or index >= length
  • Pointer dereference: Fork on pointer == NULL
  • Assertion(p): Fork on
  • Integer overflow: Fork on result exceeding type bounds
    If the fork is satisfiable:
  • Solver provides a concrete bug-triggering input
    Else:
  • Proves bug can’t happen on this path

Challenges:

  • Loops:
    • Loops with a symbolic limit leads to path explosion
      • A DFS approach will get infinitely stuck in a loop due to path explosion
    • One solution: Loop with a concrete limit
  • Pointers
    • A symbolic pointer can represent ALL variables in the application
    • Solution: Use other static analyses techniques (like alias analysis) to precompute valid pointer targets
      Challenges Revisited:
  • Path explosion (N branches -> 2^N paths)
  • Loops and recursion (can produce infinite trees)
  • Heap and pointer modeling; symbolic addresses, aliasing analysis
  • SMT solver limitations; floating point are hard/undecidable
  • Environment modeling - system calls, file I/O, network, concurrency
    Symbolic Exec works best on:
  • Smaller, well bounded programs or functions
  • Programs with limited use of external libraries
  • Unit-level testing rather than whole-system testing

Misc Testing Topics

Continuous Integration (CI)

Code changes are automatically built, tested, and integrated into a shared repo multiple times a day
Process:

  • Developer commits code
  • CI pipeline trigger
    • Project build
    • Run automated tests
  • Integrated into GitHub and Gitlab
    • Github Actions

Containers for CI:

  • Provide uniform build system for continuous integration (No $PATH vars, missing binaries etc.)
  • Automate deployment via continuous deployment (CD)
  • Solves work-on-my-machine issues

Github Actions

  • Triggered when an even occurs
    • Push event, pull request open
  • Workflow contains
    • One or more jobs
    • Each job has its own runner on which it is executed (typically a container)
      • Each job has one or more steps
      • A step is either a script, command, or another action (recursion)
  • Github Action specified in a YAML file
    • defined in .github/workflows/[NAME].yml file
    • define triggers
    • specify jobs and steps (e.g. build, test)
Mock Testing

Simulate the behavior of real objects in a controlled way
Why?

  • Isolate components for unit testing
  • External systems (APIs, Databases)
  • Complex or time-consuming operations

e.g. mock a sendEmail func without actually emailing a client

Examples:

  • Mockito
  • EASYMOCK