To get an idea of the shape of data, we can’t use Mean, Median or Mode.

Instead the Measurements of Dispersion are used, which are done with the following functions:

  • Range
  • Standard Deviation
  • Variance

Range

Simply this is the literal range, if we were measuring heights of a sample of men, the data might be:

6.5, 6.0, 5.2, 5.0, 5.5, 5.6, 6.2, 5.4

In the sample above that would be the max value of 6.5 minus the minimum value of 5.0.

To quickly get the min. and max. values from a dataset in Python:

import numpy as np
np.min([6.5, 6.0, 5.2, 5.0, 5.5, 5.6, 6.2, 5.4])
np.max([6.5, 6.0, 5.2, 5.0, 5.5, 5.6, 6.2, 5.4])

Standard Deviation

The Standard Deviation gives a square root of the sample variance (see below.)

The standard deviation is calculated by taking the Square Root of the Sum of (x – the sample mean)^2 / n (sample size) – 1.

In Python, using NumPy, we can calculate the standard deviation

(Note: the calculation of standard deviation in NumPy does not default to n-1, but rather N, meaning the population size, not a sample size):

np.std([6.5, 6.0, 5.2, 5.0, 5.5, 5.6, 6.2, 5.4]) # Population Std. Deviation

To generate a standard deviation of the sample size (meaning divide by (n-1) instead of N), you use the following optional parameter in the method call:

np.std([6.5, 6.0, 5.2, 5.0, 5.5, 5.6, 6.2, 5.4], ddof=1) # Sample Std. Deviation

Note: If you use R instead of Python, the calculation for Standard Deviation uses Sample size, NOT Population size and is done as follows:

# Code from: 
# https://stats.stackexchange.com/questions/25956/what-formula-is-used-for-standard-deviation-in-r
> #sd in R
> sd1 <- sd(x)
> 
> #self-written sd
> sd2 <- sqrt(sum((x - mean(x))^2) / (n - 1))
>  
> #comparison
> c(sd1, sd2)   #:-)

Variance

To calculate the variance we need to know the mean. The mean is used in the calculation. Variance is the sum of distances from the mean – which is calculated by summing the square of each point to the mean.

Var(X) = E[X-μ]^2

Remember that the μ is the population mean.

Using Python to calculate the variance can be done with the NumPy library like so:

import numpy as np
np.var([6.5, 6.0, 5.2, 5.0, 5.5, 5.6, 6.2, 5.4])  # Population Variance
np.var([6.5, 6.0, 5.2, 5.0, 5.5, 5.6, 6.2, 5.4], ddof=1)  # Sample Variance

Sample vs Population Variance

Sample variance takes the Sum of (x – sample mean) / n (sample size) -1

Population variance is the measure of the sum of (x – population mean) / N (population size)

Leave a Reply

Your email address will not be published. Required fields are marked *