Overview

I’ve been wanting to do more to automate my binary reverse engineering process. As I really only do in in my spare time for fun, automating the basic parts leaves me with more time to do the more interesting analysis stuff.

So I came up with a simple research ‘question’, that I would have to solve using Ghidra, and its Python-based Scripting API.

Hypothesis

Given a native Windows DLL, you can tell the difference between a malicious and benign file, but comparing the size of code that gets run when the library is loaded.

The idea is that malware authors will either:

  • Do more than the ‘normal’ entrypoint, as they are doing tricky anti-RE checks and downloading 2nd stagers; OR
  • Do less than the ‘normal’ entrypoint, as they only need to launch their payload and bail, and not prepare to be running as a real library that does stuff.

I had no idea prior to starting this if this could be true, or even how to measure size. I settled on the following process:

  1. Grab some legitimate and malicious DLLs
  2. Look in the DLL’s Optional Header and go to the DLL’s entrypoint function
  3. Measure various things from the start of the function to the end, such as:
    1. The number of instructions in the entrypoint function and all sub functions
    2. The number of functions called by the entrypoint and all sub functions
  4. Compare results (spoiler - results negative, but I learn’t how to script Ghidra for future questions)

With an idea of the steps I needed to take, I set about scripting them up.

Ghidra and Ghidra Scripting

My usual tool for binary analysis tool is Ghidra, for a few reasons:

  1. Being free means people don’t have to pay money to share/replicate/improve any work I share using it, unlike IDA Pro
  2. I haven’t found decomilation using plugins to Radare2 or Binary Ninja to work super well, particulary for Windows binaires
  3. Ghidra came with a bunch of tutorial docs on how to use it, which helped get used to it’s buttonology and interfaces

Ghidra also comes with a powerful scripting API, allowing you run scripts to automate analysis and either display them in the UI, or even run headless without a GUI at all.

There are some annoyances with writing Ghidra scripts, mostly due to the fact they have be written in Jython, which is based upon Python 2.7, and the fact Ghidra runs its own intepreter. Some of these things are

  • No f-strings or newer Python 3.6+ features
  • Can’t easily use external libraries to enhance analysis, without doing runtime path trickery.
    • Plus the libraries would have work in Jython/Python 2.7, which most won’t.
  • The Ghidra inbuit functions that your script will use won’t be on you intepreter path, so no tab-complete/docstrings in your IDE

Fortunetly, there are a couple of ways to make development easier:

  • Using Ghidra pyi Generator fixes the tab-complete/docstrings in your IDE
    • This works in VSCode, PyCharm, and probably other IDEs as well
  • Gidra Bridge marshals code from Python3 to Ghidra’s Jython intepreter, which allows you to write scripts in real Python 3, do any external library loads, etc.
    • The mashalling does slow scripts down though, so for simple scripts like mine stick to Jython

Running the Script

The final Ghidra Script I write is here: https://github.com/pathtofile/ghidra_scripting/blob/master/count_entrypoint.py The script finds the entrypoint function, and then recusivly checks this function and all sub functions and counts:

  1. Each time a function is called
  2. The number of ‘addresses’ in each function
    • This isn’t “number of instructions”, but it will be of a similar scale, so for our purposes it fits

It then tallies up:

  1. Total number of addresses
  2. Number of unique addresses
  3. Total number of functions called

To launch the script, I first got a bunch of DLLs from various places, including Windows, Firefox, Metepreter, and some real malware samples.

I then ran the following command over each of them to quickly and headlessly anaylyse them:

<path/to/ghidra/analyzeHeadless.bat> <projects_folder> testproj -postscript countentryponit.py -import "<path/to/library.dll>"

Where:

  • path/to/ghidra/analyzeHeadless.bat is the path the Ghidra installation
  • projects_folder and testproj is the path to where my Ghidra Projects live on disk, and the name of an already-created project
    • Unfortunetly I didn’t see a way to have the project also created via the commandline, so I had to first create it using the UI
  • path/to/library.dll is the DLL to analyse

Results

Here’s the output from the first pass of running the script across various DLLs:

My own MSVC dll with an empty dllmain

{
  "total_addresses": 6114,
  "total_addresses_unique": 3197,
  "total_functions": 140,
  "file_size_kb": 11
}

AMSI.dll

{
  "total_addresses": 2888,
  "total_addresses_unique": 2880,
  "total_functions": 44,
  "file_size_kb": 69.5
}

Kernel32.dll

{
  "total_addresses": 28698591,
  "total_addresses_unique": 24926,
  "total_functions": 176062,
  "file_size_kb": 746.5
}

mozavutil.dll (random firefox DLL)

{
  "total_addresses": 3769,
  "total_addresses_unique": 2912,
  "total_functions": 74,
  "file_size_kb": 191.7
}

msimg32.dll (in sys32)

{
  "total_addresses": 418,
  "total_addresses_unique": 418,
  "total_functions": 12,
  "file_size_kb": 8
}

msys-sqlite3eval-0.dll (git)

{
  "total_addresses": 1223,
  "total_addresses_unique": 1223,
  "total_functions": 9,
  "file_size_kb": 10
}

Meterpreter.dll - Basic

Created with msfvenom -p windows/meterpreter/reverse_tcp lhost=127.0.0.1 lport=1111 -a x86 -f dll -o meterp.dll

{
  "total_addresses": 319,
  "total_addresses_unique": 319,
  "total_functions": 12,
  "file_size_kb": 8
}

Meterpreter.dll - Shikata Ga Nai

Created with: msfvenom -p windows/meterpreter/reverse_tcp lhost=127.0.0.1 lport=1111 -a x86 -f dll -o meterp_enc.dll -e x86/shikata_ga_nai

{
  "total_addresses": 319,
  "total_addresses_unique": 319,
  "total_functions": 12,
  "file_size_kb": 8
}

Malware Sample

From this blog: https://inquest.net/blog/2019/01/29/Carving-Sneaky-XLM-Files

{
  "total_addresses": 3426,
  "total_addresses_unique": 3426,
  "total_functions": 2,
  "file_size_kb": 6
}

So far, the results are far from inconclusive: Windows DLLs might call 10s-100s of functions, but so did metepreter. Metepreter’s total_addresses is very different from Windows DLLs, but not the Mozilla DLL.

Conclusion

It’s back to the drawing board with the hypothis - there might be better qeustions to ask, or to focus on specific types of entrypoints, such as ServiceMain and services. But now I have a platform (Ghidra and its Scripting API) to ask more questions.