One way of checking the relationship between two variables is with covariance. This gives an indication if the two variables are correlated, one to another.
A related post is that of Dispersion and Variance:
The formula for population covariance is:
If the covariance value (between variables) is greater than 0, we get an idea that the variables move together (correlate.)
If the covariance value is less than 0, than the two variables move in opposite directions are do not correlate together.
If the covariance is equal to 0, than we say that the variables are independent.
x = np.array(df['suicides_no'],df['population']) np.cov(x).round()
In the above case, I plotted the suicide counts of countries against the population of countries. As one would expect, the larger the population size, the greater the amount of suicides. This is proved not only in the scatter plot, but also in the covariance value of 33182603041.0. While it’s greater than 0 it’s also to a scale factor that’s very extreme.
Just like the Coefficient of Variation mentioned in a previous post, there’s also a Correlation Coefficient. This coefficient creates a value from -1 to 1, indicating correlation, independent variables or opposition.
The above formula is modified for population and samples as follows:
To calculate the Correlation Coefficient in Python, we make use of Numpy like so:
np.corrcoef(df['suicides_no'],y=df['population']) >>> array([[1. , 0.83806479], [0.83806479, 1. ]])
Numpy produces a Correlation Coefficient Matrix… which is a bit more than I need. The first value of 1, is suicides to suicides… obviously correlating. The second value of 0.83806479 is the meat of the test… this is the strong correlation between the amount of suicides vs. a population size. As it’s close to the value of 1, it is a positive correlation.
As an example of independent variables, lets consider population vs. gdp per capita… in other words… the more population a country has, does it have more gdp per capita?
np.corrcoef(df['population'], df['gdp_per_capita ($)']) >>> array([[1. , 0.20428262], [0.20428262, 1. ]])
Remember, the important part of that matrix is the second value… in this case 0.20428262. This is close to 0, so we’d say it’s close to being an independent set of variables.
Pandas Correlation Coefficient
Numpy gives us a matrix result. Which I haven’t yet seen the usefulness of. We get the answer, but also some other data that doesn’t seem too useful.
Pandas offers their own method called “corr” and it produces a single result:
df['population'].corr(df['suicides_no']).round(decimals=2) >>> 0.84
This makes more sense for me… perhaps there’s a use case for the numpy matrix output, but for my uses Pandas offers a more direct answer.