# Percentiles, Quartiles, and Measuring Spread - Statistics

I recently explained several math concepts relating to the measurement of the center of data.  I explained the differences and how to obtain the mean, median, and mode for a set of data. While these statistical functions are indispensable when assessing how a data set is centered, they tell you nothing about how that data is spread out, or the variability.  To get this information, percentiles and quartiles are frequently used to supplement what you have determined about the center.

To better understand why knowledge about your data's center isn't sufficient, and why we need to know about the spread of the data as well, consider this point.  If you have a data set A composed of values 36, 38, 40, 42, 44, you can easily determine that the mean (center) of this set is 40.  Now consider a data set B that has the values 0, 20, 40, 60, 80.  This set also has mean of 40.  However, even though these two data sets have the same center, they are obviously very different!  Set A has a much tighter and narrower distribution, whereas set B has a much broader range.  This is where an analysis of the spread of the data is important to better understand your distribution of data.  Having information about both center and variability of your data distribution is one of the simplest and most useful analysis there is.

In many cases, to get a good overall view of your data, you can report the mean of your data alongside the first and third quartiles.  In addition to this, it is also good practice to report the lowest and highest (extreme) data points.

With this information, you can construct a box plot (also known as a box and whiskers plot) that conveys all of this information visually, at a glance.  Box plots are extremely helpful, especially when you can present data sets side by side for easy visual comparison.  A box plot has a mark to denote the median, inside a box that represents the range from the first to third quartile.  You can alternately think of it as being a box of the quartiles, with one edge on the first quartile, the other edge on the third quartile, and a line inside the box to denote the second quartile.  Extending past the edges are lines that end on the extreme values.  Think of the box are representing the bulk of your data (more technically, the middle 50% of your data), which is why it's thicker, and the lines on the ends represent only a thinner amount of your data, or the tails of your distribution.

Follow along below to see an example of what a box and whisker plot looks like.  This hopefully demonstrates all that I've talked about above.  Note that the center, quartiles, and extremes are all easily seen on the box plot graph, and how easy they are to compare when presented like this.

Consider the following two data sets, and then follow along with the analysis:
Set A: 32, 5, 8, 12, 2, 6, 2, 35, 32, 15, 18, 25, 22
Set B: 33, 1, 8, 33, 32, 26, 1, 18, 1, 30, 28, 29

The first thing you have to do for this statistical analysis is arrange the data from lowest to highest.

Set A: 2, 2, 5, 6, 8, 12, 15, 18, 22, 25, 32, 35
Set B: 1, 1, 1, 8, 18, 26, 28, 29, 30, 32, 33, 33

Now, you can just count off the important points.  Here, we have a relatively small number of points in each set, so it is easy to find these important marks.  I will colour them below:

Set A: 2, 2, 5, 6, 8, 12, 15, 18, 22, 25, 32
Set B: 1, 1, 1, 8, 18, 26, 28, 29, 32, 33, 33

It is a good method to draw yourself a number line, and then use it to construct your box and whiskers right above it once you've identified your important points. Here is how you could progress through the creation of this plot, along with the final result.

As you can see, you can gain a lot of information about your distributions very quickly!  You can see that Set A has a lower center, and a narrower spread (less variability).  Compare this to Set B, which has a higher center and a much broader range.  Note that the extremes are very similar in each case, so despite having similar ranges, these distributions are quite different.  One is smaller and tighter, the other is higher but broader.  Hopefully that explains quartiles for you!

Now I will briefly extend this concept to percentiles.  Technically, quartiles are only a subset of the percentiles. They represent the 25%, 50%, and 75% (roughly) marks of your set.  Percentiles, on the other hand, can represent any position in your data.  You can talk about the 95th percentile, which indicates the point that is greater than 95% of the rest of your data, or the 7th percentile where it is greater than only 7% of your data (and through induction, smaller than 93% of your data!).  You just have to determine which of your data points represents the percentage you want, and you have it!  If you compare this to how I explained the quartiles above, you can see that it is the exact same concept applied to any point that you want!  The quartiles have a special name simply because they have historically been used the most often.

So, with that information, you hopefully can now understand the importance of measuring the variability of your data in addition to measuring its center.  You can gain a lot of useful information from this simple data analysis, and it provides a great place to start when you are performing any amount of statistical analysis on a distribution of data.

In my next post, I would like to extend my discussion of data variability of data to include one of the most well known statistical functions: the standard deviation!