dataframe = {
    'Sold':[10,5,30,23,45],
    'Car':['Audi i8','BMW i8','Chevy Corvette','Dodge Viper','Mazerati GT'],
    'Top Speed':[180,183,145,165,158],
    'Price':[138,180,68,120,138],
    'MPG':['15','18','8','12','18'],
    'Deaths per Year':[15,22,200,100,20]
}

In the above code, I’ve got a data frame of some fake car data. I have five cars, with column values for Top Speed, Price, MPG and Deaths per Year.

To plot this as a bar graph, we can combine the column data into one plot using Pandas:

top_speed = car_df['Top Speed']
price = car_df['Price']
new_df = pd.DataFrame({'Top Speed':top_speed, 'Price (in thousands)': price}, index=['Audi i8', 'BMW i8', 'Chevy Corvette', 'Dodge Viper', 'Maserati GT'])
ax = car_df.plot.bar(rot=0)

I’ve not labeled the tick marks, but for simplicity sake I just wanted to show a simple bar graph. 0 would be the first car in my data – the Audi i8, then the BMW i8 would be next, and so on.

Since price has been normalized with speed (price being in thousands), we can plot both variables together. In fact, I could put all the variables in there, as they are all in the same scale or range. This way we can easily see the amount of exotic cars sold at this dealership, the top speed per car, the price per car and the deaths associated with each car, per year.

Scatter Plots

Scatter plots are a great way to see specific data points between two variables. This is especially useful for linear regression and machine learning models.

We can quickly see if there is a correlation between two variables by seeing which direction the scattered data moves.

df = pd.read_csv('/Users/bwarner/Downloads/master.csv').groupby('country').sum() # groupby and sum will total the row counts by country.
suicide_count = df['suicides_no']
p = df['population']
plt.scatter(p,suicide_count, color='r')
plt.xlabel('Population Size')  
plt.ylabel('Number of Suicides')
plt.show()

In the Python above, I’m loading some data on Suicide rates around the world. In this case I’m making checking if the suicide rate correlates to the population size. That is, if the population is larger, do we get more suicides? It would seem that we should, but what does the data say?

Running the above scatter plot does show a correlation in population size and the number of suicides:

It also shows a cluster forming. Suicides seem to be in a low range, until a population size is exceeded, that one tick movement in population size (3 to 4) causes a leap in the number of suicides.

Filtering Data with Pandas

Looking at that suicide data a bit closer, I should have also filtered out the years. This is cumulative suicides from 1987 to 2016. While we could compare suicides by year, I wanted to show how we could filter a dataframe for specific data:

df = pd.read_csv("/Users/bwarner/Downloads/master.csv")
df_2016 = df[df['year'] == 2016].groupby(['year','country']).sum()
df_2016.head()

The notation df[df[‘year’] == 2016] tells pandas to filter the data frame for the column year, where year is 2016. Now our data frame will ONLY have results for the year 2016.

Appending .groupby([‘year’,’country’]).sum() aggregates the data. Technically I don’t have to include year, because we’re only looking at 1 year (2016) but it gives insight on how to group multiple columns in Pandas.

The resulting scatter plot will now look a bit different:

For comparison purposes, I ran the same scatter plot on a few different years (below):

Leave a Reply

Your email address will not be published. Required fields are marked *