Tech News
← Back to articles

Why does a least squares fit appear to have a bias when applied to simple data?

read original related products more articles

$\begingroup$

I used python to generate a correlated data set for testing, and then plotted a basic linear least-squares fit. The result looked a bit strange to me, because the line doesn't really seem to pass "centrally" through the data. It looks a little "tilted":

So, instead, I then diagonalized the covariance matrix to obtain the eigenvector that gives the direction of maximum variance. This is shown by the black arrow in the figure below. This one does indeed point in the direction that I would expect:

So, I am looking for a bit of an intuitive explanation for what is going on here. I know that they are not measuring the same thing - the fit is minimizing the sum of the squared "errors", the vertical distances between the measured values and the fitted model. Whereas the eigenvector is chosen to maximize variance.

But I guess I am surprised by the result - I would've expected the fit to go through the center of the data cluster, and not have this bias. Is it something to do with the fact that minimizing the vertical distances somehow breaks the symmetry in the system? Is it inappropriate to apply a basic linear fit to such data?

Here is the code: