Descriptive Statistics: Frequency Distribution Table

A frequency distribution table is much like a spreadsheet. This is where we have One Way Data being represented in a table form.

Figure 1

Figure 1 above shows a frequency distribution table. Specifically it shows a tally of suicides per year by country. Each row is an observation and each column is a feature or variable.

In this case the frequency of suicides is what’s being recorded. We have some other data, such as what country, gender, and the year of the suicides.

Intervals

Intervals are like bins. In fact they’re called bins in some software libraries. A bin, or interval is a collection of similar data. Imagine if the data above had discrete data for the age column, such that results would be like: 24, 45, 62 and so on.

Intervals could be created to lump data into groups or bins… for example: 1-20, 21-40, 41-60 and so on.

Manual Calculation of Intervals

The formula for intervals is below:

(MAX value – MIN value) / Desired Intervals

For example, if we had an age range that had values of 1 to 100, and we wanted 5 intervals (or bins), we would calculate it like so:

Numpy Intervals

We can also use an automatic method to create intervals. In the Python library Numpy, we can write code to auto size our bins for us:

import numpy as np
my_range = np.array([1,4,9,22,24,26,28,33,35,43,44,48,51,52,55,59,68,69,79,80,81,83,85, 93,97,100])
bins=[1,20,40,60,80,100]
hist, _ = np.histogram(my_range, bins=bins)
print(hist)

>>> [3 6 7 3 7]

In the above output, we get 5 bins. I manually set my bins here as 1, 20, 40, 60, 80 and 100. Although it’s 6 values, really I’m getting 5 buckets… 1-20, 21-40, 41-60, 61-80 and 81-100. The array of values is then bucketed into the appropriate bins…

So the output of [3 6 7 3 7] explains that there are 3 values from 1-20, 6 values in the range of 21-40 and so on.

Pandas Intervals

Similarly to the Numpy bins example, we can create intervals with Pandas. This can be done by the following syntax:

import pandas as pd
s = pd.Series(my_range)
pc = pd.cut(s, 5).value_counts().sort_index()
pc

>>> (0.901, 20.8]    3
>>> (20.8, 40.6]     6
>>> (40.6, 60.4]     7
>>> (60.4, 80.2]     4
>>> (80.2, 100.0]    6
>>>  dtype: int64

Difference between the Pandas version and the Numpy one, is in using a value for the bins (division by 5) rather than explicitly specifying the bins.

Relative Frequency

So far I’ve shown the frequency of an event, but it’s often a great idea to get the relative frequency. This is the frequency of occurrence in relation to the other occurrences. Together all observations would total 100%.

To calculate this manually, you use the formula:

Frequency / Total Frequency

Pandas Relative Frequency

We can do the same thing automatically in Pandas, using the value_counts method:

pd.cut(s, 5).value_counts(normalize=True).sort_index()

Repeating the previous cut method, we chain the value_counts method and pass in the parameter normalize set to True. This will generate output like so:

>>> (0.901, 20.8]    0.115385
>>> (20.8, 40.6]     0.230769
>>> (40.6, 60.4]     0.269231
>>> (60.4, 80.2]     0.153846
>>> (80.2, 100.0]    0.230769
>>>  dtype: float64

At a glance we can see that 40.6-60.4 has the largest collection of values (at 27%.)

Leave a Reply

Your email address will not be published. Required fields are marked *