Skip to content
Tech News
← Back to articles

Hunting a 16-year-old SQLite WAL bug with TLA+

read original more articles
Why This Matters

This article highlights the discovery and analysis of a 16-year-old SQLite WAL bug, emphasizing the importance of formal verification methods like TLA+ in identifying long-standing, hard-to-reproduce issues. It underscores the ongoing need for rigorous testing and modeling in the database industry to ensure data integrity and reliability for consumers and developers alike.

Key Takeaways

This article was written by Marco Manino and Alberto Carretero, dqlite team at Canonical.

1. Anatomy of a SQLite bug

Recently SQLite published a new version with a fix to a long-standing bug in the way that the Write Ahead Log (WAL) is checkpointed that leads to the corruption of the database.

The important aspect of this bug is not its real-world impact (which is very low) but how long it has been in the repository, how difficult it was to find it, and how difficult it was to reproduce it. Indeed the bug has been present since 2010, for 16 years! Also, the crucial question for us, the dqlite team, is: can dqlite be affected by this?

In order to find out, we first need to be able to understand the exact sequence of steps that leads to database corruption. To do that, we will be using TLA+ to model SQLite’s behavior and quickly find a trace that allows us to reason about the bug. Then, we will create a different model that describes how dqlite uses sqlite and we will check whether the bug can happen.

2. Small introduction to WAL and checkpoints in SQLite

SQLite uses WAL mode to allow readers to not be blocked by writers. The way it achieves that is by writing to a special staging area called the Write Ahead Log (WAL). Writers can append to the end of the WAL and readers can ignore the new data until it is stable. Eventually, the staging area is moved to the database; this is called a checkpoint. To prevent the WAL from growing indefinitely, a writer will attempt to “reset” it – i.e. overwriting it – if the previous checkpoint was able to move all the pages. If you are curious to learn more you can find a very clear description in the official documentation.

SQLite orchestrates changes to the WAL using locks and shared memory. For our use-case it is enough to think about writing and checkpointing; as such, we only care about two locks:

The checkpoint lock ( CKPT_LOCK ) which is taken before running a checkpoint to prevent multiple from happening at the same time

) which is taken before running a checkpoint to prevent multiple from happening at the same time The write lock ( WRITE_LOCK ) which is taken before appending new pages to the WAL

... continue reading