Case study: recovery of a corrupted 12 TB multi-device pool

Hello, and thanks in advance for reading.

This is not a bug report. It is a case study write up of a recovery effort on a severely corrupted 12 TB multi-device pool, shared here in case any of the observations are useful to btrfs-progs development. The goal is constructive, not a complaint.

One paragraph summary

A hard power cycle on a 3 device pool (data single, metadata DUP, DM-SMR disks) left the extent tree and free space tree in a state that no native repair path could resolve. A subsequent btrfs check --repair run entered an infinite loop of 46,000+ commits with zero net progress, rotating the 4 backup_roots slots past any pre-crash rollback point. Recovery eventually succeeded through a set of 14 custom C tools built against the internal btrfs-progs API, with a final data loss of about 7.2 MB out of 4.59 TB (0.00016 percent). The pool is now fully operational.

Full analysis

I wrote the case up in a structured way that covers environment, timeline, root cause classification, the bulletproof safety criterion we derived empirically, and 9 specific areas where a relatively small upstream change would have prevented the need for most of the custom tooling.

https://github.com/msedek/btrfs_fixes/blob/main/INCIDENT-ANALYSIS.md

The nine proposed improvement areas, in order of expected impact on operators hitting similar cases:

A. Progress detection in btrfs check --repair so 46,000 commit loops abort with a clear message instead of destroying backup_roots .

Progress detection in so 46,000 commit loops abort with a clear message instead of destroying . B. Symmetric handling of BTRFS_ADD_DELAYED_REF in reinit_extent_tree , matching the existing BTRFS_DROP_DELAYED_REF exemption.

... continue reading