The challenges of soft delete

Software projects often implement "soft delete", maybe with a deleted boolean or an archived_at timestamp column. If customers accidentally delete their data, they can recover it, which makes work easier for customer support teams. Perhaps archived records are even required for compliance or audit reasons.

I've run into some trouble with soft delete designs. I'll cover those, and ponder ideas for how I'd build this in the future.

Adding an archived_at column seems to ooze complexity out into queries, operations, and applications. Recovering deleted records does happen, but 99% of archived records are never going to be read.

So, the database tables will have a lot of dead data. Depending on access patterns, that might even be a significant amount of data. I've seen APIs that didn't work well with Terraform, so Terraform would delete + recreate records on every run, and over time that led to millions of dead rows. Your database can probably handle the extra bytes, and storage is fairly cheap, so it's not necessarily a problem, at first.

Hopefully, the project decided on a retention period in the beginning, and set up a periodic job to clean up those rows. Unfortunately, I'd bet that a significant percentage of projects did neither – it's really easy to ignore the archived data for a long time.

At some point, someone might want to restore a database backup. Hopefully that's for fun and profit and not because you lost the production database at 11 am. If your project is popular, you might have a giant database full of dead data that takes a long time to recreate from a dump file.

archived_at columns also complicate queries, operations, and application code. Applications need to make sure they always avoid the archived data that's sitting right next to the live data. Indexes need to be careful to avoid archived rows. Manual queries run for debugging or analytics are longer and more complicated. There's always a risk that archived data accidentally leaks in when it's not wanted. The complexity grows when there are mapping tables involved.

Migrations have to deal with archived data too. Migrations may involve more than just schema changes – perhaps you need to fix a mistake with default values, or add a new column and backfill values. Is that going to work on records from 2 years ago? I've done migrations where these questions were not trivial to answer.

Restoring an archived record is not always as simple as just running SET archived_at = null – creating a record may involve making calls to external systems as well. I've seen complex restoration code that was always a buggy, partial implementation of the "create" API endpoint. In the end, we removed the specialized restoration code and required all restoration to go through the standard APIs – that simplified the server implementation, and ensured that old data that had since become invalid, could not be restored incorrectly – it needs to pass the new validation rules.

I'm not a fan of the archived_at column approach. It's simple at first, but in my experience, it's full of pitfalls down the line.

... continue reading