 A frequency distribution table is much like a spreadsheet. This is where we have One Way Data being represented in a table form.

Figure 1 above shows a frequency distribution table. Specifically it shows a tally of suicides per year by country. Each row is an observation and each column is a feature or variable.

In this case the frequency of suicides is what’s being recorded. We have some other data, such as what country, gender, and the year of the suicides.

## Intervals

Intervals are like bins. In fact they’re called bins in some software libraries. A bin, or interval is a collection of similar data. Imagine if the data above had discrete data for the age column, such that results would be like: 24, 45, 62 and so on.

Intervals could be created to lump data into groups or bins… for example: 1-20, 21-40, 41-60 and so on.

### Manual Calculation of Intervals

The formula for intervals is below:

(MAX value – MIN value) / Desired Intervals

For example, if we had an age range that had values of 1 to 100, and we wanted 5 intervals (or bins), we would calculate it like so:

### Numpy Intervals

We can also use an automatic method to create intervals. In the Python library Numpy, we can write code to auto size our bins for us:

``````import numpy as np
my_range = np.array([1,4,9,22,24,26,28,33,35,43,44,48,51,52,55,59,68,69,79,80,81,83,85, 93,97,100])
bins=[1,20,40,60,80,100]
hist, _ = np.histogram(my_range, bins=bins)
print(hist)

>>> [3 6 7 3 7]``````

In the above output, we get 5 bins. I manually set my bins here as 1, 20, 40, 60, 80 and 100. Although it’s 6 values, really I’m getting 5 buckets… 1-20, 21-40, 41-60, 61-80 and 81-100. The array of values is then bucketed into the appropriate bins…

So the output of [3 6 7 3 7] explains that there are 3 values from 1-20, 6 values in the range of 21-40 and so on.

### Pandas Intervals

Similarly to the Numpy bins example, we can create intervals with Pandas. This can be done by the following syntax:

``````import pandas as pd
s = pd.Series(my_range)
pc = pd.cut(s, 5).value_counts().sort_index()
pc

>>> (0.901, 20.8]    3
>>> (20.8, 40.6]     6
>>> (40.6, 60.4]     7
>>> (60.4, 80.2]     4
>>> (80.2, 100.0]    6
>>>  dtype: int64``````

Difference between the Pandas version and the Numpy one, is in using a value for the bins (division by 5) rather than explicitly specifying the bins.

## Relative Frequency

So far I’ve shown the frequency of an event, but it’s often a great idea to get the relative frequency. This is the frequency of occurrence in relation to the other occurrences. Together all observations would total 100%.

To calculate this manually, you use the formula:

Frequency / Total Frequency

### Pandas Relative Frequency

We can do the same thing automatically in Pandas, using the value_counts method:

``pd.cut(s, 5).value_counts(normalize=True).sort_index()``

Repeating the previous cut method, we chain the value_counts method and pass in the parameter normalize set to True. This will generate output like so:

``````>>> (0.901, 20.8]    0.115385
>>> (20.8, 40.6]     0.230769
>>> (40.6, 60.4]     0.269231
>>> (60.4, 80.2]     0.153846
>>> (80.2, 100.0]    0.230769
>>>  dtype: float64``````

At a glance we can see that 40.6-60.4 has the largest collection of values (at 27%.)

#