Skip to content
Tech News
← Back to articles

Case study: recovery of a corrupted 12 TB multi-device pool

read original get Data Recovery Software Suite → more articles
Why This Matters

This case study highlights the challenges and solutions involved in recovering a severely corrupted 12 TB multi-device btrfs pool, emphasizing the importance of robust recovery tools and potential upstream improvements. The successful recovery with minimal data loss demonstrates the resilience of custom tooling and the need for enhanced native safeguards in filesystem management. It underscores the ongoing need for the Linux storage ecosystem to evolve in handling complex data integrity scenarios, benefiting both developers and end-users.

Key Takeaways

Hello, and thanks in advance for reading.

This is not a bug report. It is a case study write up of a recovery effort on a severely corrupted 12 TB multi-device pool, shared here in case any of the observations are useful to btrfs-progs development. The goal is constructive, not a complaint.

One paragraph summary

A hard power cycle on a 3 device pool (data single, metadata DUP, DM-SMR disks) left the extent tree and free space tree in a state that no native repair path could resolve. A subsequent btrfs check --repair run entered an infinite loop of 46,000+ commits with zero net progress, rotating the 4 backup_roots slots past any pre-crash rollback point. Recovery eventually succeeded through a set of 14 custom C tools built against the internal btrfs-progs API, with a final data loss of about 7.2 MB out of 4.59 TB (0.00016 percent). The pool is now fully operational.

Full analysis

I wrote the case up in a structured way that covers environment, timeline, root cause classification, the bulletproof safety criterion we derived empirically, and 9 specific areas where a relatively small upstream change would have prevented the need for most of the custom tooling.

https://github.com/msedek/btrfs_fixes/blob/main/INCIDENT-ANALYSIS.md

The nine proposed improvement areas, in order of expected impact on operators hitting similar cases:

A. Progress detection in btrfs check --repair so 46,000 commit loops abort with a clear message instead of destroying backup_roots .

Progress detection in so 46,000 commit loops abort with a clear message instead of destroying . B. Symmetric handling of BTRFS_ADD_DELAYED_REF in reinit_extent_tree , matching the existing BTRFS_DROP_DELAYED_REF exemption.

... continue reading