30 Years of Decompilation and the Unsolved Structuring Problem: Part 1
TL;DR
Retrospection
As 2024 begins, I’ve taken a moment to think about the last year(s) of decompilation research, since I spent most of my 2023 working on decompilation. I’ve largely spent that time working on a fundamental area of decompilation research called control flow structuring. Although I find this area critical to decompilation, I realize most security people have only heard of it in passing. With that in mind, I wanted to take a moment to recap the (short) history of binary decompilation and structuring, loosely based on my Ohio State talk on the same topic. I also mix in some of my advisor’s, Dr. Fish Wang, NDSS talk.
To clarify, this post will be about binary decompilation, or decompilation on programs produced by compiled languages like C, C++, Rust, and Go. I will focus on the history of decompilation as preliminary information needed to understand the control flow structuring. Luckily (or unluckily?), decompilation as a research area has a relatively small history, even considering non-academic approaches like IDA Pro.
Decompiler Origins
Many researchers credit Dr. Cristina Cifuentes with publishing the first work on decompilation in her 1994 dissertation “Reverse Compilation Techniques”. However, other decompilers also existed at this time, and are documented on program-transformation.org. In Cifuentes work, she reaffirms the idea that after we have a disassembled program, in the form of a Control Flow Graph (CFG), we still have more work to do:
“The control flow graph … has no information on high level language control structures such as if..then..elses and while() loops. Such a graph can be converted into a structured high level language graph by means of a structuring algorithm.”
A structuring algorithm takes this CFG, which may already be lifted to an intermediate language (IL), and converts it into high-level structures. In Cifuentes work, the structuring algorithm matches patterns in the graph, referred to as graph schema, and marks them as structures. Some of those schema can be seen below:
You keep identifying structure, bottom-up, until you run out of structures to identify. Take note that it is always possible to exhaust patterns in the graph and make C code because of the special goto structure. Indeed the goto structure can be constructed from any node with an edge, which must be used sparingly for decent code.
... continue reading