In a previous post, Frequency Distribution tables were briefly discussed.
In this post, I’m going to quickly show how to build a cross table in Python Pandas.
Unlike a spreadsheet where each row is an observation, a cross tab (or cross table) gives a tally of multiple variables. For example, look at the data loaded in from a Kaggle Spreadsheet on Mall Customers.
Simple Cross Table
We can see in the first few rows, that this is a normal data table. But what if we wanted to know a tally of customers by age and gender?
In Python Pandas we can do that using the crosstab method:
Row and Column Tally
The simple cross table above shows us how many 18 year old customers are male and how many are female. We can scroll through the age groups and find the appropriate gender count.
We can also collect the total customers (female + male) by simply adding a parameter of margins set to True:
pd.crosstab(df['Age'], df['Gender'], margins=True)
The output of the above is quite long (as age is not grouped into intervals), so I’ll cut the top and bottom portions to show the tally’s:
Now we have added a column and row that shows the total (called “All”) results. Line by line we can now see how each age has X Female customers, Y Male customers and the total of customers for that age.
We also get a total of all Female and all Male customers by checking the row totals at the very bottom. In this case 112 Female customers and 88 Male, and a total of 200.
More Variables in the Crosstab
More variables can be added. There’s a column in this data called “Spending Score” and it ranges from 1 to 100. If we add a Row of entry of Spending Score and Age, we can also get more insights:
pd.crosstab([df['Spending Score (1-100)'],df['Age']], df['Gender'], margins=True).head()
In spending score group 1, we have two ages: 34 and 37 – which are made up of only men.