Geometric Intuition Of Variance Estimators
When dealing with samples drawn from a stochastic source that produces independent and identically normally distributed samples but of unknown mean and variance, we naturally want to estimate the mean and variance of the distribution.
The mean can be straightforwardly estimated using the sample mean:
This estimator is unbiased, meaning that the expected value of is equal to the true mean . In other words, while any single estimation of might be off due to the randomness in the samples, does not systematically overestimate or underestimate the true mean.
When it comes to the variance , one might initially think to compute it directly from the samples as
However, this estimator tends to underestimate the true variance. This is because is minimized exactly at , as can be seen by differentiating for . So we need to account for the fact that our estimated mean is in some way perfectly centered between the samples (by minimizing the squared distances), which the true mean is most probably not.
On the other hand, if we know the true mean , then is indeed an unbiased estimator of the variance.
The true, unbiased estimator for variance in case of unknown mean is somewhat surprisingly
(which works for only for obvious reasons). This adjustment might seem arbitrary at first, but it can be understood intuitively by considering a geometric perspective.
Instead of thinking of these samples individually, it helps to view them collectively as a single point in . Remember that the projection of any vector onto a unit vector is given by . So the projection of on the diagonal unit vector yields exactly , which is the estimated mean of in . It makes sense, we said before the mean minimizes the the sum of squared distances , and this is exactly the squared distance from to the vector , which the projection operator is minimizing.
The unbiased estimator , which we want to guess without knowing the true , corresponds to one th of the squared distance between and . We can now ask: if we know the distance of from the diagonal, but we don’t know where the true mean lies along that diagonal, how far away would we expect it to be from the projection of onto the diagonal? In other words, what fraction of the squared distance to the true mean do we expect to point away from the diagonal, because this is the only value we know.
By properties of the normal distribution, we know that might deviate from the true mean in any direction with equal probability. That means if we only measured the squared distance to the true mean along one axis, we could already use that as an estimator of the variance, because we can assume that the deviation along all the other axes is independent and, on average, just the same as the one we measured. If we measured along two axes, we would simply take the average of the squared distances as an estimator for the variance, and so on.
Compare that with the case where we do know the true mean but only sample one or two variables. In that situation, taking the average squared distance as estimator is perfectly fine. Crucially, this doesn’t just work along the standard coordinate axes, virtually any coordinate system would do. That’s beacuse of the rotational invariance of the normal distribution: sampling points and plotting them in leaves no trace of the coordinate system we used to plot.
Now, the distance of to the diagonal is simply a measurement in axes. We could just extend the diagonal to an orthogonal coordinate system, so “distance to the diagonal” just means we know all the coordinates except for the one along the diagonal. By the same reasoning we used before, that is exactly where the division by in the estimator comes from!