Tuesday, September 18, 2012

Standard Deviation and Variance - Statistics

I've already discussed several of the ways that you can represent the center of a distribution. I've also already presented to you the concepts of variability and spread of the data; I explained in my last post percentiles (and by extension, quartiles), and then how these points can be used together with the center and extreme values to provide you with a five-point overview of your distribution (shown in the box and whiskers graphs). However, a far more common way of describing the spread of your data is by reporting the standard deviation.

The standard deviation provides a different view of the spread of the distribution. Instead of reporting certain points within the data set (as with the box and whiskers plot), the standard deviation is essentially an average measure of how far away each data point is from the mean of the data set. Quite literally, it indicates how far the data "deviates" from the mean, or, how close all the data is to the mean. Think about what this is saying, and hopefully you can see how this also appears to be a good way to describe how wide the data is spread. For any normal, or bell-shaped distribution, a smaller standard deviation means that you will have a narrower peak, since most of the data is close to the mean. Alternatively, a large SD would mean that the distribution is fatter and more spread out.

Unfortunately, these calculations require rather tedious formulas, if doing the math by hand. Thankfully, calculators often have shortcuts, but I will leave that with you to explore on your own calculator.

To begin, I first need to make a comment about the notation that I am going to use. Much like what I said in the discussion of sample mean vs population mean, the standard deviation and variance statistical functions can be used when talking about either smaller samples of populations, or entire populations. Being so, they also have differing notations to designate which distribution is being analyzed:

When talking about a population, standard deviation is denoted by "σ" (sigma).
When talking about a sample, standard deviation is denoted by "s".

To calculate the standard deviation (manually), you must first find the mean, and then you find the variance of the data set, and from there you only have one final, simple operation. Here is a general definition to mathematically describe the variance and standard deviation: The variance is equal to the average squared deviation from the mean, and the standard deviation is simply the square root of the variance. Sounds simple enough, right? Here is the formula you use to calculate the sample variance, which (logically enough) has the symbol of s²:

Notice how this is similar to how you would calculate an average. You sum up "n" values, and then divide by the number of values. In this case, however, there is a very important point to make. When calculating these statistics for a SAMPLE, you use the term "n-1" instead of "n". This is essentially a correction factor that is used, to account for the fact that not the entire population was used in the analysis. When finding a POPULATION variance and standard deviation, it is a truer average because in that case, you do divide by the total number of units.

Since the numerator of this expression is just a large sum, we can also rewrite the formula using the shorthand "sigma" notation that I demonstrated previously in my post about means:

As I said above, if you have done all this hard work to find the variance, the standard deviation is a simple step away at this point. All you need to do is take the square root of both sides.

I know... these expressions and concepts look messy and complicated. And if you have a large data set, it can get very cumbersome to perform these mathematics by hand. In those cases, calculators are extremely helpful, or better yet, a spreadsheet program like Excel. However, as I've said several times, it it very beneficial for you to understand how to do things like this the long way before you learn to take the shortcuts later. It ensures that you have a solid grasp on the math concepts at work.

In this case, it may help to not think of the formulas as they are written, but rather as what they are trying to do. In general, you are just adding up the squared distances of each data point from the mean, and then dividing by a whole number of data points, paying attention to whether it is a sample or population. By simply doing that, you are finding the variance first, which you can then easily use to determine the standard deviation.

But you now are probably asking "just what does the standard deviation tell me?" Beyond the general definition I gave above, here is a bit more information that you will find useful in understanding these concepts better. If you have calculated the standard deviation as outlined above, you have solved for s (or, for 1 times s). One standard deviation away from the mean represents about 68% of the data set, two SD's (or 2 times s) encompass about 95%, and three SD's include almost 99%. From this, you can now also relate the standard deviations to the general shape of your distribution (i.e. is it wide or narrow?). I found that Robert Niles' website has quite a helpful discussion on this topic.

So, now hopefully you have a better understanding of these common statistics! You will use these stats countless times in your mathematics studies, so I really hope that I have done a decent job explaining what is going on, if not at least just how to go about doing these analyses! Remember, I really appreciate the social shares if you found any values in my posts! Thanks!

Saturday, September 8, 2012

Percentiles, Quartiles, and Measuring Spread - Statistics

I recently explained several math concepts relating to the measurement of the center of data. I explained the differences and how to obtain the mean, median, and mode for a set of data. While these statistical functions are indispensable when assessing how a data set is centered, they tell you nothing about how that data is spread out, or the variability. To get this information, percentiles and quartiles are frequently used to supplement what you have determined about the center.

To better understand why knowledge about your data's center isn't sufficient, and why we need to know about the spread of the data as well, consider this point. If you have a data set A composed of values 36, 38, 40, 42, 44, you can easily determine that the mean (center) of this set is 40. Now consider a data set B that has the values 0, 20, 40, 60, 80. This set also has mean of 40. However, even though these two data sets have the same center, they are obviously very different! Set A has a much tighter and narrower distribution, whereas set B has a much broader range. This is where an analysis of the spread of the data is important to better understand your distribution of data. Having information about both center and variability of your data distribution is one of the simplest and most useful analysis there is.

To start this discussion, I will begin with where we have already been: the median. I explained that the median is the midpoint of your ordered data set. So, 50% of your data points are below your median, and 50% is above it. Now, from here, let me introduce the concept of quartiles first. Personally, when I think of quartiles, I think of quarters immediately. If a median is at the point representing 50% of the way through your data, the quartiles represent the points that are quarters through your set, or (roughly) every 25%. So, the first quartile (called Q1) is at the point 25% of the way along your ordered set of data points (or, it is larger than 25% of the data, but still smaller than 75%). It is better to say this is the median of the data on the left side of your overall median. On the other end of your data is the third quartile (Q3), representing the 75% mark of your data, where it is larger than 75% of your group but 25% is greater still. Similarly, you would find this by finding the median of the data that is to the right of the overall median. And for those astute enough to catch that I skipped over the second quartile, it is true that the second quartile is the same as the median (M), at the 50% mark. So, from this, you can hopefully already see that you can get a good idea of the spread of your data with this brief analysis. (Technically, I refer to these points as 25% and 75% marks, though I recognize that these are approximations. It is more appropriate to think of them as medians of each half of the data, and if you do the math to say they are the x-th value out of n-values, you will find that they are around 25 or 75, but likely not exactly. I say this here not to confuse you, but because I'm sure some of my readers will point it out to me!)

In many cases, to get a good overall view of your data, you can report the mean of your data alongside the first and third quartiles. In addition to this, it is also good practice to report the lowest and highest (extreme) data points.

With this information, you can construct a box plot (also known as a box and whiskers plot) that conveys all of this information visually, at a glance. Box plots are extremely helpful, especially when you can present data sets side by side for easy visual comparison. A box plot has a mark to denote the median, inside a box that represents the range from the first to third quartile. You can alternately think of it as being a box of the quartiles, with one edge on the first quartile, the other edge on the third quartile, and a line inside the box to denote the second quartile. Extending past the edges are lines that end on the extreme values. Think of the box are representing the bulk of your data (more technically, the middle 50% of your data), which is why it's thicker, and the lines on the ends represent only a thinner amount of your data, or the tails of your distribution.

Follow along below to see an example of what a box and whisker plot looks like. This hopefully demonstrates all that I've talked about above. Note that the center, quartiles, and extremes are all easily seen on the box plot graph, and how easy they are to compare when presented like this.

Consider the following two data sets, and then follow along with the analysis:
Set A: 32, 5, 8, 12, 2, 6, 2, 35, 32, 15, 18, 25, 22
Set B: 33, 1, 8, 33, 32, 26, 1, 18, 1, 30, 28, 29

The first thing you have to do for this statistical analysis is arrange the data from lowest to highest.

Set A: 2, 2, 5, 6, 8, 12, 15, 18, 22, 25, 32, 35
Set B: 1, 1, 1, 8, 18, 26, 28, 29, 30, 32, 33, 33

Now, you can just count off the important points. Here, we have a relatively small number of points in each set, so it is easy to find these important marks. I will colour them below:

Set A: 2, 2, 5, 6, 8, 12, 15, 18, 22, 25, 32
Set B: 1, 1, 1, 8, 18, 26, 28, 29, 32, 33, 33

It is a good method to draw yourself a number line, and then use it to construct your box and whiskers right above it once you've identified your important points. Here is how you could progress through the creation of this plot, along with the final result.

As you can see, you can gain a lot of information about your distributions very quickly! You can see that Set A has a lower center, and a narrower spread (less variability). Compare this to Set B, which has a higher center and a much broader range. Note that the extremes are very similar in each case, so despite having similar ranges, these distributions are quite different. One is smaller and tighter, the other is higher but broader. Hopefully that explains quartiles for you!

Now I will briefly extend this concept to percentiles. Technically, quartiles are only a subset of the percentiles. They represent the 25%, 50%, and 75% (roughly) marks of your set. Percentiles, on the other hand, can represent any position in your data. You can talk about the 95th percentile, which indicates the point that is greater than 95% of the rest of your data, or the 7th percentile where it is greater than only 7% of your data (and through induction, smaller than 93% of your data!). You just have to determine which of your data points represents the percentage you want, and you have it! If you compare this to how I explained the quartiles above, you can see that it is the exact same concept applied to any point that you want! The quartiles have a special name simply because they have historically been used the most often.

So, with that information, you hopefully can now understand the importance of measuring the variability of your data in addition to measuring its center. You can gain a lot of useful information from this simple data analysis, and it provides a great place to start when you are performing any amount of statistical analysis on a distribution of data.

In my next post, I would like to extend my discussion of data variability of data to include one of the most well known statistical functions: the standard deviation!

Thursday, August 30, 2012

Sample Mean vs. Population Mean

I did not mention this is my last post that outlined some basic statistical functions related to central tendency, but when studying stats, it is important to understand that you will be dealing with two kinds of means: a sample mean, and a population mean. Conceptually, they both do the same kind of thing, though their meanings are slightly different. It is a very good idea to know when to use sample mean vs population mean, and I will try to go over these concepts and their uses in this post.

Mathematically, I already explained how you determine a mean value. It is what you have likely always known as an average value, and you can very easily find it using the mean formula. Without actually writing out the equation, you already know that the mean is the sum of all your values, divided by the total number of your values. This is straightforward and nothing new. Here is where I am going to make a distinction that you need to be aware of.

In statistics, you deal with populations. Populations are complete groups of people, of things, of measurements. As an example, you likely know of the population of the planet Earth. That refers to all of the people on the planet. Or, you could have a population of bald eagles in a nesting ground, or a population of Ferrari sports car manufactured in 2011. Populations refer to the whole group of whatever you are talking about. However, in many cases, you don't have access to data about the entire population. You only have access to a subset of that population... a sample of the population. So, a sample can be considered to be a small part of the population, but is representative of that population as a whole.

A sample could also be looked at as only an estimate of the larger population. They are frequently sufficient enough to work with, since having data for an entire population could involve a very complicated and long set of data, and the closer your sample size is to your population size, the more accurate this estimate becomes. This is why people tend to question things that are only based upon a few observations... error is higher when sample size is smaller. More observations means less error.

So then, with those definitions in mind, you should hopefully be able to understand what is meant by population mean and sample mean. Literally, a population mean is the average of the entire population, whereas the sample mean is the average of a sample (which represents a larger population). Of course, since this is mathematics, we have different ways to write the notation for these two statistics concepts.

When we are talking about a population mean, where we have data about all of the subjects or measurements of a given population, we represent that data by the Greek letter mu, which looks like a fancy lower-case u:

This is calculated by summing all of the values in the entire population, and then dividing by the total number of values in that population, which is denoted by a capital N for a population.

On the other hand, when we are dealing with a sample mean (a subset that is representative of a whole population), we denote this function by the aforementioned symbol, x-bar:

As before, we find this by summing all of the values in your set, and then dividing by the total number of values in your set, in this case, the number being denoted by a lower-case n for a sample.

As I mentioned, calculating these values means essentially doing the same thing. However, in stats, it is wise to pay attention to the group that you are analyzing. Making a mistake at this point could lead to much larger errors in any further statistical analysis. Keep in mind that a sample mean is an approximation of a population mean, and that approximation becomes more accurate as the size of your sample (n values) approaches the size of your whole population (N values).

Sunday, August 26, 2012

Central Tendency - Statistics

In this first post of my new Statistics series of posts, it is going to be a refresh of the most common statistics you have ever done (and perhaps didn't realize were actually statistics): measures of central tendency. I've previously done posts that briefly described the functions that measure center (e.g mean, median, mode), but here I am going to compile them all together in one place, and provide perhaps a better explanation of these statistics concepts.

Mean

The first statistic that I include here is the most common statistic with which you have likely ever worked. You probably know it be the name "average" but in the field of statistics, you will find it referred to by "mean," "arithmetic mean," or "arithmetic average." It probably doesn't need much of an explanation, as most students learn how to calculate averages very early on in school! It represents a calculated measure of the center of a distribution of values, simply obtained by adding up all of the values and then dividing that sum by the number of values you added together. (It is important to be aware that there are different types of means in statistics: sample and population means. I describe these in more detail in a separate post. For the sake of demonstration, consider the math in this post to describe samples instead of populations.)

There are a couple of important points to make about the notation involved in calculating means. The first is regarding the actual mathematical symbol for mean (because you don't want to always have to write down the word "mean" in your solutions!). The symbol for mean is written as an x (or whatever variable you are using) with a small horizontal bar over it, like this:

You say this symbol as "x bar." You can use and will see this notation wherever an arithmetic mean value is being used in statistical analysis and calculations. It is extraordinarily common, yet would appear confusing at first to a student who is new to statistics, because it looks like nothing they had ever dealt with before.

In addition to this, there is a second notation that you will see that may need an explanation first. This notation is used to describe the arithmetic mean formula. I explained the concept and process of calculating a mean above, but here is one way in which you could write this down in your work:

Mathematically, this simply says that the mean is equal to the sum of all your values (x₁ all the way up to x_whatever) divided by the total number of values that you are adding up. This average formula could also be represented in another way, like this:

This formula for mean is saying the same thing as the previous one. The 1/n part is the same in both equations (in the first, dividing by n is the same as multiplying by 1/n). The fancy capital E-looking thing is the Greek capital letter sigma (which is not equivalent to E, but rather to S), and in math, it means to "sum up everything in the following equation." And the x_i part represents all the values of x. So the sigma would start with x₁, then add x₂, then add x₃, and so on, for all the values of x. (I will do a separate post on sigma notation to perhaps explain this a bit better, with more examples.)

An important concept to understand about the mean is just what exactly it represents, and how it can be influenced by its dataset. For a collection of values that are similar, the mean will provide a fairly reasonable measure of the center of this data. However, if you consider the inclusion of any extreme values, you can see how this would cause the arithmetic average to be biased in its direction. The more extreme the outliers are, the greater their effect on the mean. Try for yourself to see what I mean. Consider the dataset of values 1, 2, 3, 4, 5, and then consider the dataset of 1, 2, 3, 4, 20. You can see that the mean is pulled in the direction of the outlier. This is simply a result of how the mean is calculated, and is one of the flaws of it as a statistical tool. Similarly, if have a distribution of values in your dataset that are "skewed" (that is, if you graph them out, you will see that the graph isn't symmetrical, and it has a tail on one end), the long tail will tend to bias the measurement of the mean in its direction. Because of these characteristics, the mean is considered not to be a resistant measure (in that it can't resist being pulled by extreme data). However, despite these points, the mean is an incredibly useful tool for statistics, if for no other reason that it is so simple to use, and provides a very quick evaluation of how the dataset is centered.

Median

The median is a second of the three measures of center that I want to talk about here. Conceptually, I think that it is probably even simpler to understand than the mean. It's much easier to calculate. Whereas the arithmetic mean requires you to perform the calculation I described above (or really keen people know how to use their calculator's mean calculation function!), to determine the median, you don't have to do any mathematic operations at all! Quite basically, the median represents the midpoint of your dataset, the point where half of the data is larger and the other half is smaller. You don't have to calculate it, you just have to identify it.

To do this, all you need to do is take your dataset, and arrange all of the values in increasing size. The value in the center is your median, often represented by the capital letter M. When you have an odd number of values in your dataset, you will be able to find the median very easily. You can identify it through a quick calculation to find which is the center value, which is simply the (n+1)/2 value in your order, where n is the total number of values in your dataset. Note that this median formula only tells you where in the order your median is located, not the value of the median. If you have an even number of values, then your median is represented by the mean of the two center values (using the same calculation above to determine the location, you'll result in a location 4.5 for example, indicating that the median is the mean of the values at locations 4 and 5). So, in this case, your median does not necessarily have to be one of your data points, but instead the average of the middle two.

Determining the median can be a very tedious process if you have a very large dataset. In these cases, the use of a spreadsheet software will come in extremely handy! Then, you can automatically sort your values, and then identify the one(s) you require. For small data sets, on the other hand, it takes very little effort to sort through and rearrange the values, making the median another very simple and useful statistical tool to evaluate central tendency.

There are a few differences to consider when comparing the mean and the median. Since the mean uses the actual data values in its calculation, it is influenced more by extreme or skewed data. Therefore, the median will represent a better estimate of the center of the distribution. In this sense, the median can be considered to be a more resistant measure than the mean. So, if you have a symmetric distribution of data, the mean and the median will be very similar. However, when you have skewed distributions, the mean will be located more in the long tail of the distribution, further away from the median. Consider, if you have a set of prices in a data set, and then you double the highest price, the median will be the same in both cases, though the doubled price point will push the mean much further away and more towards that extreme end of the distribution. The mean and the median provide differing assessments of the central tendency of a distribution, but both functions are extremely useful in statistical analysis.

Mode

The mode is the third statistical function used to evaluate the center of a dataset. It is just as easy to determine as the median. Once again, there is no mathematic operation needed to determine the mode. That is because the mode is quite simply the value that is most common in your dataset. If you have arranged all of your data points in increasing order to assess the median, as described above, then it is quite easy to find the mode. For example, in the dataset 1, 2, 3, 4, 4, 4, 5, the mode is 4 because it is repeated the most often. See? Easy!

There are two points to keep in mind: for one, you can have more than one mode (if you have a dataset of 1, 2, 2, 3, 4, 4, then you have two modes, 2 and 4); and second, if no term is repeated at all, then there is no mode to the dataset.

It is also good to know that of these three statistical functions I've covered already, the mode is the only stat that can be applied to non-numerical datasets. For example, the mode could be used to say what colour shirt is the most commonly worn shirt in an office. The dataset for this could read like: red, white, blue, green, blue, yellow, blue, blue, red, black, and so the mode is blue. You can't arrange these in increasing size to find a median. You cannot apply the mean formula. These concepts don't make any sense when you consider this data! However, the mode makes perfect sense and is very easy to determine.

Range

While we're considering these methods of measuring central tendency, it would also be useful to mention range. Technically, range doesn't provide any sense of measure of center. However, it is very useful in evaluating the spread of the data, or how close the values are to the distribution's center. Range is another simple concept, but it may not be exactly what you would think it should be. If you have a dataset, or a graph of a distribution, you would be incorrect to say that the range is the low value to the high value. (This would be similar to the definition of range in graphing, where range is all the y-values on the curve.) However, range is slightly different in statistics. Range is the DIFFERENCE between the high and low values of your dataset. So, for the dataset 5, 6, 7, 10, 13, the range is 13-5, which is 8 (not 5 to 13, as may be thinking).

So, that is mean, median, mode, and range. These are some of the most basic and common statistical operations that you will encounter. Quite likely, you have already used some of these before, and may not have realized that you were actually doing statistical analysis of a dataset. They are all different, and so they provide different assessments of how your data behaves. They are all useful in their own way, and each shows strength in analyzing different types of data. Therefore, it is extremely important that you learn what each stat means, and how to evaluate them.

I hope that this post has been informative and helpful for you! If it was, please don't forget to hit the +1 button below, or click here to share by tweeting about it!

Monday, August 20, 2012

A New School Year is Coming...

With the new school year rapidly approaching (again), I am thinking that it might be a good time to (once again) (try to) begin a cohesive series of posts to discuss a particular unit. I have been finding that a lot of my readers actually arrive on my site through various searches of statistical functions and concepts. I have previously introduced the concepts of mean, median, and mode, and provided a brief discussion about these measures of central tendency, and the situations in which you would use each function. But I think it might be useful to several of my visitors to provide more than just these brief introductions.

Admittedly, I do not have a statistics background. I am much stronger in the various mathematics concepts that I have already covered on my site (e.g. algebra and calculus), and I only ever studied basic stats and probability in university. However, if you consider that I will be explaining these functions essentially as I am learning them, hopefully my explanations will be at the perfect level for others who are just learning them as well. It will be like a study group where we learn together, and can discuss ideas back and forth so that everyone can benefit. I welcome, appreciate, and encourage feedback on all of my posts, and I can say that I hope to get even more feedback about the coming posts. I might also add that I am open to considering guest posts by other math writers, maybe from those who know the subject of statistics better than I do! If anyone is interested, please don't hesitate to contact me to discuss.

My posts will start shortly, as I find available time to put them together. These may take more time to prepare than I usually require, so please have patience, and come back soon. Hopefully, I'll be able to put together a nice package to get Statistics Explained for you!