Bayesian statistics for confused data scientists

It’s the third time I’ve fallen into the Bayesian rabbit hole. It always goes like this: I find some cool article about it, it feels like magic, whoever is writing about it is probably a little smug about how much cooler than frequentism it is (and I don’t blame them), and yet I still leave confused about what exactly is happening. This post is a cathartic attempt to force myself into making sense out of everything I’ve read so far, and hopefully it will also be useful to the legions out there who surely feel the same way as I do.1

Bayesian vs. frequentist statistics: the story of a feud

The frequentist approach is so dominant that when you learn statistics, it’s not named as such, it just is statistics. The Bayesian approach, on the other hand, is this weird niche that only a few people seem reeeeally into. It’s the Haskell of statistics. And just like its programming counterpart, this little tribe of Bayesians is actually right to love it so much.

At its heart, the difference between Bayesian and frequentist statistics is about the philosophical role that probability plays into the framework. In both frameworks, you have parameters (usually some unknown quantities which determine how things behave) and you have data (or observations), which are things you’ve measured.

A simple example would be if you roll a die a bunch of times. The parameter here is the number of faces n n n (intuitively, we all know the more faces, the less likely a given face will appear), while the data is just the collected faces you see as you roll the die. Let me tell you right now that for my example to make any sense whatsoever, you have to make the scenario a bit more convoluted. So let’s say you’re playing DnD or some dice-based game, but your game master is rolling the die behind a curtain. So you don’t know how many faces the die has (maybe the game master is lying to you, maybe not), all you know is it’s a die, and the values that are rolled. A frequentist in this situation would tell you the parameter n n n is fixed (although unknown), and the data is just randomly drawn from the uniform distribution X ∼ U ( n ) X \sim \mathcal{U}(n) X∼U(n). A Bayesian, on the other hand, would say that the parameter n n n is itself a random variable drawn from some other distribution P P P, with its own uncertainty, and that the data tells you what that distribution truly is.

I’m going to pause here for you to take a breath and yell at your screen that it makes no sense. Of course, the number of faces is fixed, it’s a die! What Bayesian statistics quantifies with the distribution P P P is not how random the number of faces is, but how uncertain you are about it. This is the crucial difference and the whole reason why Bayesian statistics is so powerful. In frequentist approaches, uncertainty is often an afterthought, something you just tack on using some sample-to-population formula after the fact. Maybe if you feel fancy you use some bootstrapping method. And whatever interval you get from this is a confidence interval, it doesn’t tell you how likely the parameter is to be within, but how often the intervals constructed this way will contain the parameter. This is often a confusing point which makes confidence intervals a very misunderstood concept. In Bayesian statistics, on the other hand, the parameter is not a point but a distribution. The spread of that distribution already accounts for the uncertainty you have about the parameter, and the credible interval you get from it actually tells you how likely the parameter is to be within it.

On a more mathematical note, the difference between the two approaches lies within Bayes’ famous theorem which tells you how conditional probabilities relate to each other:

P ( A ∣ B ) P ( B ) = P ( B ∣ A ) P ( A ) . P(A|B)P(B) = P(B|A)P(A)~. P ( A ∣ B ) P ( B ) = P ( B ∣ A ) P ( A ) .

That’s it! If you take this equation and you stick in it the parameters θ \theta θ and the data X X X, you get P ( θ ∣ X ) = P ( X ∣ θ ) P ( θ ) P ( X ) P(\theta|X) = \frac{P(X|\theta)P(\theta)}{P(X)} P(θ∣X)=P(X)P(X∣θ)P(θ), which is the cornerstone of Bayesian inference. This may not seem immediately useful, but it truly is. Remember that X X X is just a bunch of observations, while θ \theta θ is what parametrizes your model. So P ( X ∣ θ ) P(X|\theta) P(X∣θ), the likelihood, is just how likely it is to see the data you have for a given realization of the parameters. Meanwhile, P ( θ ) P(\theta) P(θ), the prior, is some intuition you have about what the parameters should look like. I will get back to this, but it’s usually something you choose. Finally, you can just think of P ( X ) P(X) P(X) as a normalization constant, and one of the main things people do in Bayesian inference is literally whatever they can so they don’t have to compute it! The goal is of course to estimate the posterior distribution P ( θ ∣ X ) P(\theta|X) P(θ∣X) which tells you what distribution the parameter takes. The posterior distribution is useful because

it gives you a clear idea of your uncertainty on the model parametrization, you can use it to build the posterior predictive distribution P ( Y ∣ X ) = ∫ P ( Y ∣ θ ) P ( θ ∣ X ) d θ P(Y|X) = \int P(Y|\theta) P(\theta | X) \mathrm{d} \theta ~ P ( Y ∣ X ) = ∫ P ( Y ∣ θ ) P ( θ ∣ X ) d θ where Y Y Y is new data.

... continue reading