Building off the article below, I wanted to continue this discussion dealing with Right and Left Skewness. Specifically, how to calculate this in Python or visualize Asymmetry issues in Python.
Right skewness is when the the Mean is Larger than the Median. The tail is longer towards the right – and that points to where the outliers are (on the right side of the graph.)
Left skewness is the opposite of the above. In this case the tail is longer towards the left and that points to where the outliers are, skewing the data.
No skewness is a state where the curve is balanced well and has equal tails on both sides.
import scipy.stats as stats df = pd.read_csv('/Users/bwarner/Downloads/master.csv').groupby('country').sum() # groupby and sum will total the row counts by country. h = np.asarray(df['suicides_no']) h = sorted(h) fit = stats.norm.pdf(h, np.mean(h), np.std(h)) plt.plot(h,fit,'--',linewidth =2) plt.hist(h,normed=True,bins=100) plt.show()
Going back to the suicide data from previous posts, I created a numpy array (using asarray) of the dataframe column ‘suicides_no’ (which is a count of suicides per country.)
We have a lot of skewness here. There are three main outliers to the right, that are pulling the tail of the plot to the right. This is heavily right skewed.
# check for skewness: print('Skew =', df['suicides_no'].skew()) # check for kurtosis (steepness of data shape) print('Kurtosis =',df['suicides_no'].kurt())
To check for skewness I used the skew method on this data frame slice. Which returned a value of 4.80, very skewed to the right.
Kurtosis is also being checked above. This measures the peaks or how tall the data skews upwards. In this case the returning result is 24.84. It’s also very skewed upwards (Kurtosis.)
I mentioned variance in the other article linked at the top of this one. In this post I wanted to dig a bit deeper. I had previously neglected to offer the formula for population and sample variance.
Before getting to the formula, we can define variance as the dispersion of data around the mean.
Population Variance Formula
To calculate this in Python, we make use of Numpy’s var method:
# By default, numpy variance assumes a population variance calculation: np.var(df['suicides_no'])
In this case, the value returned is quite large:
Using another example (of a much smaller dataset), we can get a better idea of variance:
gdp = [45,56,29,33,50,42,28,44,50,33,39] np.var(gdp).round(decimals=2) >>> 77.97
The formula for sample variance is:
In Python we calculate sample variance using Numpy as well, we use the same methodology, but pass in a parameter of ddof, with the value of 1:
# Sample variance is done by using the ddof parameter and setting it to the value of 1: np.var(gdp, ddof=1) >>> 85.76363636363637
Problems of Variance
On its own variance can give us a gigantic number that is well beyond the scale we’re working with. An often better choice of dispersion calculation is the Standard Deviation.
Like Variance, Standard Deviation has both a population and sample formula.
Population Standard Deviation
The formula above is the square root of the population variance.
# Population Standard deviation: np.std(gdp) >>> 8.829889135700421
Sample Standard Deviation
Very similar to the above, we simply take the square root of the Sample variance:
# Sample Standard Deviation: np.std(gdp, ddof=1) >>> 9.260865853884093
The above Python (Numpy) std method will calculate the Sample Standard Deviation (note the ddof=1, which sets this to calculate the sample standard deviation.)
np.std(df['suicides_no']) >>> 181256.89619224408
However, the value above is still pretty hard to manage. To make it even more useful we need to consider the Coefficient of Variation (CV.)
Coefficient of Variation (CV)
Above we have the Population Coefficient of Variation and the Sample variety (below it.)
This formula is also called the Relative Standard Deviation.
gbp = np.array([76000.0,85000.0,80000.0,92000.0,59000.0], "float") eur = np.multiply(gbp,1.17) print("GBP Prices: ", gbp) print("EUR Prices: ", eur) >>> GBP Prices: [76000. 85000. 80000. 92000. 59000.] >>> EUR Prices: [ 88920. 99450. 93600. 107640. 69030.]
For illustration, I made two lists. The first is a list of 5 products that are sold in Great British Pounds. The other list is the same products sold in Euros. The difference between the prices is a constant 1.17 (exchange rate.)
If we didn’t know that these were consistent pricing (and we wondered if price gauging might be hidden in the exchange rate) we might think, what’s the comparison of these two data sets… how are they related.. could one price be more than the exchange rate?
combined_list = np.array([gbp, eur]) np.std(combined_list)
Making a combined list of the two lists, the standard deviation is checked and it gives us a number… but what does it mean? What does the output of 13772.067528152771 mean?
A better approach is to get this coefficient of variation… this is done with a stats method (from the stats library) called variation (not to be confused with the previous form of variation.)
The stats library has a variation method that calculates CV. So the items in the combined array are analyzed to see any relative change:
import scipy.stats as stats stats.variation(combined_list) >>> array([0.07834101, 0.07834101, 0.07834101, 0.07834101, 0.07834101])
The output is an array, which shows the coefficient and this coefficient of variation is consistent from each price in each list. We know we have a consistent coefficient, in other words, the exchange price is consistent.
What if we had some weird divergence in the two prices?
gbp = np.array([76000.0,85000.0,80000.0,92000.0,59000.0], "float") other = np.array([36000.0,15000.0,30000.0,99000.0,99000.0], "float") bad_list = np.array([gbp,other]) stats.variation(bad_list) >>> array([0.35714286, 0.7 , 0.45454545, 0.03664921, 0.25316456])
The output in this other calculation is showing a variation in coefficients. the first is 0.36, then we get 0.7 and 0.455, etc. Right away we can see that the two data sets have different variation in values.