To better understand why knowledge about your data's center isn't sufficient, and why we need to know about the spread of the data as well, consider this point. If you have a data set A composed of values 36, 38, 40, 42, 44, you can easily determine that the mean (center) of this set is 40. Now consider a data set B that has the values 0, 20, 40, 60, 80. This set also has mean of 40. However, even though these two data sets have the same center, they are obviously very different! Set A has a much tighter and narrower distribution, whereas set B has a much broader range. This is where an analysis of the spread of the data is important to better understand your distribution of data. Having information about both center and variability of your data distribution is one of the simplest and most useful analysis there is.
To start this discussion, I will begin with where we have already been: the median. I explained that the median is the midpoint of your ordered data set. So, 50% of your data points are below your median, and 50% is above it. Now, from here, let me introduce the concept of quartiles first. Personally, when I think of quartiles, I think of quarters immediately. If a median is at the point representing 50% of the way through your data, the quartiles represent the points that are quarters through your set, or (roughly) every 25%. So, the first quartile (called Q1) is at the point 25% of the way along your ordered set of data points (or, it is larger than 25% of the data, but still smaller than 75%). It is better to say this is the median of the data on the left side of your overall median. On the other end of your data is the third quartile (Q3), representing the 75% mark of your data, where it is larger than 75% of your group but 25% is greater still. Similarly, you would find this by finding the median of the data that is to the right of the overall median. And for those astute enough to catch that I skipped over the second quartile, it is true that the second quartile is the same as the median (M), at the 50% mark. So, from this, you can hopefully already see that you can get a good idea of the spread of your data with this brief analysis. (Technically, I refer to these points as 25% and 75% marks, though I recognize that these are approximations. It is more appropriate to think of them as medians of each half of the data, and if you do the math to say they are the x-th value out of n-values, you will find that they are around 25 or 75, but likely not exactly. I say this here not to confuse you, but because I'm sure some of my readers will point it out to me!)
In many cases, to get a good overall view of your data, you can report the mean of your data alongside the first and third quartiles. In addition to this, it is also good practice to report the lowest and highest (extreme) data points.
With this information, you can construct a box plot (also known as a box and whiskers plot) that conveys all of this information visually, at a glance. Box plots are extremely helpful, especially when you can present data sets side by side for easy visual comparison. A box plot has a mark to denote the median, inside a box that represents the range from the first to third quartile. You can alternately think of it as being a box of the quartiles, with one edge on the first quartile, the other edge on the third quartile, and a line inside the box to denote the second quartile. Extending past the edges are lines that end on the extreme values. Think of the box are representing the bulk of your data (more technically, the middle 50% of your data), which is why it's thicker, and the lines on the ends represent only a thinner amount of your data, or the tails of your distribution.
Follow along below to see an example of what a box and whisker plot looks like. This hopefully demonstrates all that I've talked about above. Note that the center, quartiles, and extremes are all easily seen on the box plot graph, and how easy they are to compare when presented like this.
Consider the following two data sets, and then follow along with the analysis:
Set A: 32, 5, 8, 12, 2, 6, 2, 35, 32, 15, 18, 25, 22
Set B: 33, 1, 8, 33, 32, 26, 1, 18, 1, 30, 28, 29
The first thing you have to do for this statistical analysis is arrange the data from lowest to highest.
Set A: 2, 2, 5, 6, 8, 12, 15, 18, 22, 25, 32, 35
Set B: 1, 1, 1, 8, 18, 26, 28, 29, 30, 32, 33, 33
Now, you can just count off the important points. Here, we have a relatively small number of points in each set, so it is easy to find these important marks. I will colour them below:
Set A: 2, 2, 5, 6, 8, 12, 15, 18, 22, 25, 32
Set B: 1, 1, 1, 8, 18, 26, 28, 29, 32, 33, 33
It is a good method to draw yourself a number line, and then use it to construct your box and whiskers right above it once you've identified your important points. Here is how you could progress through the creation of this plot, along with the final result.
As you can see, you can gain a lot of information about your distributions very quickly! You can see that Set A has a lower center, and a narrower spread (less variability). Compare this to Set B, which has a higher center and a much broader range. Note that the extremes are very similar in each case, so despite having similar ranges, these distributions are quite different. One is smaller and tighter, the other is higher but broader. Hopefully that explains quartiles for you!
Now I will briefly extend this concept to percentiles. Technically, quartiles are only a subset of the percentiles. They represent the 25%, 50%, and 75% (roughly) marks of your set. Percentiles, on the other hand, can represent any position in your data. You can talk about the 95th percentile, which indicates the point that is greater than 95% of the rest of your data, or the 7th percentile where it is greater than only 7% of your data (and through induction, smaller than 93% of your data!). You just have to determine which of your data points represents the percentage you want, and you have it! If you compare this to how I explained the quartiles above, you can see that it is the exact same concept applied to any point that you want! The quartiles have a special name simply because they have historically been used the most often.
So, with that information, you hopefully can now understand the importance of measuring the variability of your data in addition to measuring its center. You can gain a lot of useful information from this simple data analysis, and it provides a great place to start when you are performing any amount of statistical analysis on a distribution of data.
In my next post, I would like to extend my discussion of data variability of data to include one of the most well known statistical functions: the standard deviation!