To get an idea of the shape of data, we can’t use Mean, Median or Mode.
Instead the Measurements of Dispersion are used, which are done with the following functions:
- Standard Deviation
Simply this is the literal range, if we were measuring heights of a sample of men, the data might be:
6.5, 6.0, 5.2, 5.0, 5.5, 5.6, 6.2, 5.4
In the sample above that would be the max value of 6.5 minus the minimum value of 5.0.
To quickly get the min. and max. values from a dataset in Python:
import numpy as np np.min([6.5, 6.0, 5.2, 5.0, 5.5, 5.6, 6.2, 5.4]) np.max([6.5, 6.0, 5.2, 5.0, 5.5, 5.6, 6.2, 5.4])
The Standard Deviation gives a square root of the sample variance (see below.)
The standard deviation is calculated by taking the Square Root of the Sum of (x – the sample mean)^2 / n (sample size) – 1.
In Python, using NumPy, we can calculate the standard deviation
(Note: the calculation of standard deviation in NumPy does not default to n-1, but rather N, meaning the population size, not a sample size):
np.std([6.5, 6.0, 5.2, 5.0, 5.5, 5.6, 6.2, 5.4]) # Population Std. Deviation
To generate a standard deviation of the sample size (meaning divide by (n-1) instead of N), you use the following optional parameter in the method call:
np.std([6.5, 6.0, 5.2, 5.0, 5.5, 5.6, 6.2, 5.4], ddof=1) # Sample Std. Deviation
Note: If you use R instead of Python, the calculation for Standard Deviation uses Sample size, NOT Population size and is done as follows:
# Code from: # https://stats.stackexchange.com/questions/25956/what-formula-is-used-for-standard-deviation-in-r > #sd in R > sd1 <- sd(x) > > #self-written sd > sd2 <- sqrt(sum((x - mean(x))^2) / (n - 1)) > > #comparison > c(sd1, sd2) #:-)
To calculate the variance we need to know the mean. The mean is used in the calculation. Variance is the sum of distances from the mean – which is calculated by summing the square of each point to the mean.
Var(X) = E[X-μ]^2
Remember that the μ is the population mean.
Using Python to calculate the variance can be done with the NumPy library like so:
import numpy as np np.var([6.5, 6.0, 5.2, 5.0, 5.5, 5.6, 6.2, 5.4]) # Population Variance np.var([6.5, 6.0, 5.2, 5.0, 5.5, 5.6, 6.2, 5.4], ddof=1) # Sample Variance
Sample vs Population Variance
Sample variance takes the Sum of (x – sample mean) / n (sample size) -1
Population variance is the measure of the sum of (x – population mean) / N (population size)