Sunday, August 26, 2012

Central Tendency - Statistics

In this first post of my new Statistics series of posts, it is going to be a refresh of the most common statistics you have ever done (and perhaps didn't realize were actually statistics): measures of central tendency.  I've previously done posts that briefly described the functions that measure center (e.g mean, median, mode), but here I am going to compile them all together in one place, and provide perhaps a better explanation of these statistics concepts.


The first statistic that I include here is the most common statistic with which you have likely ever worked.  You probably know it be the name "average" but in the field of statistics, you will find it referred to by "mean," "arithmetic mean," or "arithmetic average."  It probably doesn't need much of an explanation, as most students learn how to calculate averages very early on in school!  It represents a calculated measure of the center of a distribution of values, simply obtained by adding up all of the values and then dividing that sum by the number of values you added together.  (It is important to be aware that there are different types of means in statistics: sample and population means. I describe these in more detail in a separate post.  For the sake of demonstration, consider the math in this post to describe samples instead of populations.)

There are a couple of important points to make about the notation involved in calculating means.  The first is regarding the actual mathematical symbol for mean (because you don't want to always have to write down the word "mean" in your solutions!).  The symbol for mean is written as an x (or whatever variable you are using) with a small horizontal bar over it, like this:

You say this symbol as "x bar."  You can use and will see this notation wherever an arithmetic mean value is being used in statistical analysis and calculations.  It is extraordinarily common, yet would appear confusing at first to a student who is new to statistics, because it looks like nothing they had ever dealt with before.

In addition to this, there is a second notation that you will see that may need an explanation first.  This notation is used to describe the arithmetic mean formula.  I explained the concept and process of calculating a mean above, but here is one way in which you could write this down in your work:

Mathematically, this simply says that the mean is equal to the sum of all your values (x1 all the way up to xwhatever) divided by the total number of values that you are adding up.  This average formula could also be represented in another way, like this:

This formula for mean is saying the same thing as the previous one.  The 1/n part is the same in both equations (in the first, dividing by n is the same as multiplying by 1/n).  The fancy capital E-looking thing is the Greek capital letter sigma (which is not equivalent to E, but rather to S), and in math, it means to "sum up everything in the following equation."  And the xi part represents all the values of x.  So the sigma would start with x1, then add x2, then add x3, and so on, for all the values of x.  (I will do a separate post on sigma notation to perhaps explain this a bit better, with more examples.)

An important concept to understand about the mean is just what exactly it represents, and how it can be influenced by its dataset.  For a collection of values that are similar, the mean will provide a fairly reasonable measure of the center of this data.  However, if you consider the inclusion of any extreme values, you can see how this would cause the arithmetic average to be biased in its direction.  The more extreme the outliers are, the greater their effect on the mean.  Try for yourself to see what I mean.  Consider the dataset of values 1, 2, 3, 4, 5, and then consider the dataset of 1, 2, 3, 4, 20.  You can see that the mean is pulled in the direction of the outlier.  This is simply a result of how the mean is calculated, and is one of the flaws of it as a statistical tool.  Similarly, if have a distribution of values in your dataset that are "skewed" (that is, if you graph them out, you will see that the graph isn't symmetrical, and it has a tail on one end), the long tail will tend to bias the measurement of the mean in its direction.  Because of these characteristics, the mean is considered not to be a resistant measure (in that it can't resist being pulled by extreme data).  However, despite these points, the mean is an incredibly useful tool for statistics, if for no other reason that it is so simple to use, and provides a very quick evaluation of how the dataset is centered.


The median is a second of the three measures of center that I want to talk about here.  Conceptually, I think that it is probably even simpler to understand than the mean.  It's much easier to calculate.  Whereas the arithmetic mean requires you to perform the calculation I described above (or really keen people know how to use their calculator's mean calculation function!), to determine the median, you don't have to do any mathematic operations at all!  Quite basically, the median represents the midpoint of your dataset, the point where half of the data is larger and the other half is smaller.  You don't have to calculate it, you just have to identify it.

To do this, all you need to do is take your dataset, and arrange all of the values in increasing size.  The value in the center is your median, often represented by the capital letter M.  When you have an odd number of values in your dataset, you will be able to find the median very easily.  You can identify it through a quick calculation to find which is the center value, which is simply the (n+1)/2 value in your order, where n is the total number of values in your dataset.  Note that this median formula only tells you where in the order your median is located, not the value of the median.  If you have an even number of values, then your median is represented by the mean of the two center values (using the same calculation above to determine the location, you'll result in a location 4.5 for example, indicating that the median is the mean of the values at locations 4 and 5).  So, in this case, your median does not necessarily have to be one of your data points, but instead the average of the middle two.

Determining the median can be a very tedious process if you have a very large dataset.  In these cases, the use of a spreadsheet software will come in extremely handy!  Then, you can automatically sort your values, and then identify the one(s) you require.  For small data sets, on the other hand, it takes very little effort to sort through and rearrange the values, making the median another very simple and useful statistical tool to evaluate central tendency.

There are a few differences to consider when comparing the mean and the median.  Since the mean uses the actual data values in its calculation, it is influenced more by extreme or skewed data.  Therefore, the median will represent a better estimate of the center of the distribution.  In this sense, the median can be considered to be a more resistant measure than the mean.  So, if you have a symmetric distribution of data, the mean and the median will be very similar.  However, when you have skewed distributions, the mean will be located more in the long tail of the distribution, further away from the median.  Consider, if you have a set of prices in a data set, and then you double the highest price, the median will be the same in both cases, though the doubled price point will push the mean much further away and more towards that extreme end of the distribution.  The mean and the median provide differing assessments of the central tendency of a distribution, but both functions are extremely useful in statistical analysis.


The mode is the third statistical function used to evaluate the center of a dataset.  It is just as easy to determine as the median.  Once again, there is no mathematic operation needed to determine the mode.  That is because the mode is quite simply the value that is most common in your dataset.  If you have arranged all of your data points in increasing order to assess the median, as described above, then it is quite easy to find the mode.  For example, in the dataset 1, 2, 3, 4, 4, 4, 5, the mode is 4 because it is repeated the most often.  See?  Easy!  

There are two points to keep in mind: for one, you can have more than one mode (if you have a dataset of 1, 2, 2, 3, 4, 4, then you have two modes, 2 and 4); and second, if no term is repeated at all, then there is no mode to the dataset.  

It is also good to know that of these three statistical functions I've covered already, the mode is the only stat that can be applied to non-numerical datasets.  For example, the mode could be used to say what colour shirt is the most commonly worn shirt in an office.  The dataset for this could read like: red, white, blue, green, blue, yellow, blue, blue, red, black, and so the mode is blue.  You can't arrange these in increasing size to find a median.  You cannot apply the mean formula.  These concepts don't make any sense when you consider this data!  However, the mode makes perfect sense and is very easy to determine.


While we're considering these methods of measuring central tendency, it would also be useful to mention range.  Technically, range doesn't provide any sense of measure of center.  However, it is very useful in evaluating the spread of the data, or how close the values are to the distribution's center.  Range is another simple concept, but it may not be exactly what you would think it should be.  If you have a dataset, or a graph of a distribution, you would be incorrect to say that the range is the low value to the high value.  (This would be similar to the definition of range in graphing, where range is all the y-values on the curve.)  However, range is slightly different in statistics.  Range is the DIFFERENCE between the high and low values of your dataset.  So, for the dataset 5, 6, 7, 10, 13, the range is 13-5, which is 8 (not 5 to 13, as may be thinking).

So, that is mean, median, mode, and range.  These are some of the most basic and common statistical operations that you will encounter.  Quite likely, you have already used some of these before, and may not have realized that you were actually doing statistical analysis of a dataset.  They are all different, and so they provide different assessments of how your data behaves.  They are all useful in their own way, and each shows strength in analyzing different types of data.  Therefore, it is extremely important that you learn what each stat means, and how to evaluate them.

I hope that this post has been informative and helpful for you!  If it was, please don't forget to hit the +1 button below, or click here to share by tweeting about it!


  1. I like how it all depends on your definition of "center." The mean is the "center" if you want to know where the most "normal" or "average" (in the English sense) case would be. Like if you wanted to know what an average person would've got on the ACT.

    The median is the "center" if you want to find somewhere that will divide your data in half (50% below, 50% above). Like if you wanted to know if you were above or below half of the other test-takers with your own ACT score.

    The mode is the "center" if you want to find the most popular data value. Like if you wanted to know what "most people got" on the ACT.

  2. I think it's important to mention that x-bar is the SAMPLE mean or else it could get confused later with the population mean (mu).

  3. Algebra is a branch of mathematics where certain kinds of problems are solved using equations. An equation can be looked upon as a weighing scale or beam balance .

  4. I found this blog very useful to learn statistics.Online math help is a great option that fits the bill perfectly. It offers a lot of flexibility in scheduling sessions, in fact the most out of all the tutoring options. Students choose the time to schedule sessions. This means that your tutoring is compatible with your daily activities.


Related Posts