Calculating Standard Deviation

Most standard books on Biostatistics give two formulae for calculating the standard deviation (SD) of a set of observations.

The population SD (for a finite population) is defined as \sqrt{\frac{\sum_{i=1}^{N}\left ( X_{i} -\mu \right )^{2}}{N}}, where X1, X2, …, XN are the N observations and μ is the mean of the finite population.

The sample SD is defined as \sqrt{\frac{\sum_{i=1}^{n}\left ( x_{i} -\bar{x} \right )^{2}}{n-1}}, where x1, x2, …, xn are the n observations in the sample and \bar{x} is the sample mean.

While teaching a Biostatistics course, it becomes a little difficult to explain to students about why there are two different formulae, and why do we subtract ‘1’ from n when we calculate the SD using sample observations. These are students who are (or will be) collecting samples, performing experiments, obtaining sample data, and analyzing data. Hence it is important that they understand the concept from a practitioners point, rather than a theoretical point of view.

In my experience, the best explanation has been to avoid a theoretical discussion of unbiasedness etc. and to state that if we want to calculate the SD from sample data, we also need to know the true (population) mean. Since the population mean is unknown, we use the same n observations in the sample to first calculate the sample mean \bar{x} as an estimate of the population mean μ. Our sample observations however tend to be more close to the sample mean rather than the population mean and hence the SD calculated using the first formula will underestimate the true variability. To correct for this underestimation we divide by n – 1 instead of n. Then the question of why n – 1? We have already used the n observations to calculate one quantity, i.e., \bar{x} . We are then left with only n – 1degrees of freedom’ for further calculations that use \bar{x} .

I also state three more points to complete the topic:

  1. That the quantity n – 1 is referred to as the degrees of freedom for computing the SD and we say that 1 degree of freedom has been used up while estimating μ using sample data.
  2. For most situations that the students would encounter, the complete population is not available and they would have to work with sample data. So in most (if not all) situations, they should be using the formula with n – 1 to calculate the SD.
  3. As the sample size increases, the differences arising out of using n or n – 1 should be of little practical importance. But it is still conceptually correct to use n – 1.

When the students were then asked as to how they would define degrees of freedom, one of them came up with a definition as ‘number of observations minus number of calculations’ which is pretty much what it is, as far as a practitioner is concerned!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s