Tuesday, September 18, 2012

Standard Deviation and Variance - Statistics

I've already discussed several of the ways that you can represent the center of a distribution.  I've also already presented to you the concepts of variability and spread of the data; I explained in my last post percentiles (and by extension, quartiles), and then how these points can be used together with the center and extreme values to provide you with a five-point overview of your distribution (shown in the box and whiskers graphs).  However, a far more common way of describing the spread of your data is by reporting the standard deviation.

The standard deviation provides a different view of the spread of the distribution.  Instead of reporting certain points within the data set (as with the box and whiskers plot), the standard deviation is essentially an average measure of how far away each data point is from the mean of the data set.  Quite literally, it indicates how far the data "deviates" from the mean, or, how close all the data is to the mean. Think about what this is saying, and hopefully you can see how this also appears to be a good way to describe how wide the data is spread.  For any normal, or bell-shaped distribution, a smaller standard deviation means that you will have a narrower peak, since most of the data is close to the mean.  Alternatively, a large SD would mean that the distribution is fatter and more spread out.

Unfortunately, these calculations require rather tedious formulas, if doing the math by hand.  Thankfully, calculators often have shortcuts, but I will leave that with you to explore on your own calculator.

To begin, I first need to make a comment about the notation that I am going to use. Much like what I said in the discussion of sample mean vs population mean, the standard deviation and variance statistical functions can be used when talking about either smaller samples of populations, or entire populations. Being so, they also have differing notations to designate which distribution is being analyzed:
  • When talking about a population, standard deviation is denoted by "σ" (sigma). 
  • When talking about a sample, standard deviation is denoted by "s". 
To calculate the standard deviation (manually), you must first find the mean, and then you find the variance of the data set, and from there you only have one final, simple operation.  Here is a general definition to mathematically describe the variance and standard deviation: The variance is equal to the average squared deviation from the mean, and the standard deviation is simply the square root of the variance. Sounds simple enough, right? Here is the formula you use to calculate the sample variance, which (logically enough) has the symbol of s2:

Notice how this is similar to how you would calculate an average.  You sum up "n" values, and then divide by the number of values.  In this case, however, there is a very important point to make.  When calculating these statistics for a SAMPLE, you use the term "n-1" instead of "n".  This is essentially a correction factor that is used, to account for the fact that not the entire population was used in the analysis.  When finding a POPULATION variance and standard deviation, it is a truer average because in that case, you do divide by the total number of units.

Since the numerator of this expression is just a large sum, we can also rewrite the formula using the shorthand "sigma" notation that I demonstrated previously in my post about means:

As I said above, if you have done all this hard work to find the variance, the standard deviation is a simple step away at this point.  All you need to do is take the square root of both sides.


I know... these expressions and concepts look messy and complicated.  And if you have a large data set, it can get very cumbersome to perform these mathematics by hand.  In those cases, calculators are extremely helpful, or better yet, a spreadsheet program like Excel.  However, as I've said several times, it it very beneficial for you to understand how to do things like this the long way before you learn to take the shortcuts later.  It ensures that you have a solid grasp on the math concepts at work.

In this case, it may help to not think of the formulas as they are written, but rather as what they are trying to do.  In general, you are just adding up the squared distances of each data point from the mean, and then dividing by a whole number of data points, paying attention to whether it is a sample or population.  By simply doing that, you are finding the variance first, which you can then easily use to determine the standard deviation.

But you now are probably asking "just what does the standard deviation tell me?"  Beyond the general definition I gave above, here is a bit more information that you will find useful in understanding these concepts better.  If you have calculated the standard deviation as outlined above, you have solved for s (or, for 1 times s).  One standard deviation away from the mean represents about 68% of the data set, two SD's (or 2 times s) encompass about 95%, and three SD's include almost 99%.  From this, you can now also relate the standard deviations to the general shape of your distribution (i.e. is it wide or narrow?).  I found that Robert Niles' website has quite a helpful discussion on this topic.

So, now hopefully you have a better understanding of these common statistics!  You will use these stats countless times in your mathematics studies, so I really hope that I have done a decent job explaining what is going on, if not at least just how to go about doing these analyses!  Remember, I really appreciate the social shares if you found any values in my posts!  Thanks!

Related Posts