The world, and the system2 Not necessarily a software system, but often in my case. we’re trying to analyse, is a complex web of causal factors that influence each other as well as the outcome. Most people ignore this complexity when trying to analyse accidents, e.g. by doing a root cause analysis. The root cause methodology pretends the world is relatively simple and that problems can be tracked down to a single, arbitrary root cause. This straightforward model of the world makes analysis easier. It also guarantees that the result is no longer useful in the real world.
A root cause methodology allows us to find a somewhat easy, convenient target to fix, fix it, and then declare victory. In contrast, with proper analysis we will find a multitude of problems, only some of which we may be able to fix. This may leave a sour taste, especially in the mouths of managers who will ask why we are wasting time finding problems we can’t fix.
However, this is also what’s nice about a more realistic model of the world: any given accident tries to teach us many lessons. Rarely did only one thing go wrong. Not listening to everything an accident is trying to tell us is a waste of a good crisis.
A thorough analysis takes time. We won’t be able to deeply analyse every accident. The upshot is that when we aim for quality over quantity, we will learn multiple lessons from each accident, and we will learn more factors that are common to multiple accidents. In other words, we analyse fewer accidents but what we learn from it can be used to prevent more accidents than a shallow analysis would allow. A shallow analysis tends to lead to only papering over symptoms.
Something people often forget is that we cannot entirely eliminate accidents from complex systems. In addition to preventing accidents, we need to design systems such that the negative impact of an accident is limited. In my experience, the more reliable systems tend to actually break more often, but their breakages are of lower severity and easier to fix.