Diff Algorithms - GoKawiil

For software engineers, diffs are a ubiquitous method for representing changes: We use diffs to compare different versions of the same file (e.g., during code review or when trying to understand the history of a file), to visualize the difference of a failing test compared with its expectation, or to apply changes to source files automatically.

Every project I worked on professionally or privately eventually needed a diff to visualize a change or to apply a patch. However, I have never been satisfied with any of the freely available diff libraries. This was never really a problem professionally, but for private projects, I have copied and modified my own library from project to project until I mentioned this to a colleague who set me on the path to publish my Go library (a port of a previous C++ library I used to copy and modify). Boy, did I underestimate how close my library was to publishability!

Anyway, I did it and I learned a whole lot about diff algorithms. You can find my library at znkr.io/diff and what I learned in this article. I am not finished learning yet, so I plan to update this article as my understanding continues to evolve.

Existing Diff Libraries

Let me start by explaining why I am dissatisfied with existing diff libraries. There are a number of attributes that are important to me. Not all of these attributes are important for every use case, but a diff library that I can use for all of my use cases needs to fulfill all of them.

Usually, the input to a diff algorithm is text, and most diff libraries only support that. However, I occasionally have use cases where I need to compare things that are not text. So any diff library that only supports text doesn't meet my needs; instead, I need support for arbitrary sequences.

The resulting diff output is intended to be readable by humans. Quite often, especially for text, a good way to present a diff is in the unified format. However, it's not always the best presentation. A diff library should make it easy to output a diff in unified format, but it should also provide a way to customize the presentation by providing a structured result.

Besides the presentation, the content of a diff should make it easy for humans to understand the diff. This is a somewhat subjective criterion, but there are a number of failure cases that are easily avoided, and there's some research into diff readability to set a benchmark. On the other hand, diffs should be minimal in that they should be as small as possible.

Last but not least, it's important that a diff library has a simple API and provides good performance in both runtime and memory usage, even in worst-case scenarios.

With that, we can evaluate existing diff libraries. For Go, I went through a number of libraries and summarized them.

... continue reading