In statistics, there is the concept of the sum of squares, which is a way of finding out, so to speak, how far your data is dispersed from the average. In short, you take your data set, calculate the average, and then subtract each item in the data set from the average, square the difference, and add them up.

For example, the data set 3, 5, 7 obviously has the average of 5. The sum of squares is (3 – 5)^2 + (5 – 5)^2 + (7 – 5)^2, which equals 2^2 + 0^2 + (-2)^2, or 4 + 0 + 4, or 8.

Now take the data set 2, 5, 8. Both data sets have the same average, but the second data set has a much larger sum of squares (18) meaning the data is more disperse.

Why do we always square the numbers, and not just take the result of the subtraction? So that the answer is always positive. If we didn’t square the subtractions, the first data set would have a sum of 2 + 0 + (-2), or 0. Likewise, the second set would have a sum of 3 + 0 + (-3), or 0. By squaring, we prevent them from canceling out.

In short, the sum of squares is always positive. Here endeth the lesson.

So with this brief primer on the topic of sum of squares, pretend you are a statistics programmer, charged with writing small programs to compute statistics of various data sets, and your sum of squares occasionally gives a result that is very, very negative? What’s the problem?

If you answered “arithmetic overflow“, you are correct, and are clearly over-qualified to work in the field of statistics, programming, or both.

Advertisements