Measures of Dispersion

To get an idea of the shape of data, we can’t use Mean, Median or Mode.

Instead the Measurements of Dispersion are used, which are done with the following functions:

  • Range
  • Standard Deviation
  • Variance

Range

Simply this is the literal range, if we were measuring heights of a sample of men, the data might be:

6.5, 6.0, 5.2, 5.0, 5.5, 5.6, 6.2, 5.4

In the sample above that would be the max value of 6.5 minus the minimum value of 5.0.

To quickly get the min. and max. values from a dataset in Python:

import numpy as np
np.min([6.5, 6.0, 5.2, 5.0, 5.5, 5.6, 6.2, 5.4])
np.max([6.5, 6.0, 5.2, 5.0, 5.5, 5.6, 6.2, 5.4])

Standard Deviation

The Standard Deviation gives a square root of the sample variance (see below.)

The standard deviation is calculated by taking the Square Root of the Sum of (x – the sample mean)^2 / n (sample size) – 1.

In Python, using NumPy, we can calculate the standard deviation

(Note: the calculation of standard deviation in NumPy does not default to n-1, but rather N, meaning the population size, not a sample size):

np.std([6.5, 6.0, 5.2, 5.0, 5.5, 5.6, 6.2, 5.4]) # Population Std. Deviation

To generate a standard deviation of the sample size (meaning divide by (n-1) instead of N), you use the following optional parameter in the method call:

np.std([6.5, 6.0, 5.2, 5.0, 5.5, 5.6, 6.2, 5.4], ddof=1) # Sample Std. Deviation

Note: If you use R instead of Python, the calculation for Standard Deviation uses Sample size, NOT Population size and is done as follows:

# Code from: 
# https://stats.stackexchange.com/questions/25956/what-formula-is-used-for-standard-deviation-in-r
> #sd in R
> sd1 <- sd(x)
> 
> #self-written sd
> sd2 <- sqrt(sum((x - mean(x))^2) / (n - 1))
>  
> #comparison
> c(sd1, sd2)   #:-)

Variance

To calculate the variance we need to know the mean. The mean is used in the calculation. Variance is the sum of distances from the mean – which is calculated by summing the square of each point to the mean.

Var(X) = E[X-μ]^2

Remember that the μ is the population mean.

Using Python to calculate the variance can be done with the NumPy library like so:

import numpy as np
np.var([6.5, 6.0, 5.2, 5.0, 5.5, 5.6, 6.2, 5.4])  # Population Variance
np.var([6.5, 6.0, 5.2, 5.0, 5.5, 5.6, 6.2, 5.4], ddof=1)  # Sample Variance

Sample vs Population Variance

Sample variance takes the Sum of (x – sample mean) / n (sample size) -1

Population variance is the measure of the sum of (x – population mean) / N (population size)

Leave a Reply

Your email address will not be published. Required fields are marked *