In taking the Udemy course on Data Analysis with R by Sandeep Kumar, I took one of his tests. It’s a very simplistic question, but it gets a person thinking and running various tests in RStudio with R.
Explaining R is beyond the scope of this simple post. As a high level overview, R is basically a language designed for Statistics and Data. Unlike Python, R is not a general purpose language. R is niche in its field, but it does have some libraries that can export the data into Dashboards and web applications.
The question asked by the teacher was the following:
Question 1: For your perfume bottle filling company, you are buying a machine from other manufacturer. You are offered two machine to select one out of these.
Before finally deciding on the machine, you pick 10 samples from each of these machine and note down the volume.
If this was the only consideration in selecting the machine, which of these two machines you will select?
Machine 1: 151.2, 150.5, 149.2, 147.5, 152.9, 152.0, 151.3, 149.7, 149.4, 150.7
Machine 2: 151.9, 151.4, 150.3, 151.2, 151.0, 150.2, 151.2, 151.4, 150.4, 151.7https://www.udemy.com/statistics-using-r/learn/lecture/8013602#questions
While this isn’t “real world” in scope it does give a student some exposure to thinking about how to approach solving a problem with statistics. As a side note, you can solve this in any statistical package/library/software of your choice. I’ll be using R.
Thinking About the Data
Looking at the question we have two samples, from two different machines. We want to make an educated guess on which machine might be the best (based on how much volume it can handle – higher scores being better.)
The first idea that comes to mind is to check the Mean. The Mean will give us an indicator of what average volume each machine can produce. But, we must keep in mind that the Mean alone will not suffice. Outliers will also skew the data, making the Mean inefficient.
> machine1 = c(151.2, 150.5, 149.2, 147.5, 152.9, 152.0, 151.3, 149.7, 149.4, 150.7) > machine2 = c(151.9, 151.4, 150.3, 151.2, 151.0, 150.2, 151.2, 151.4, 150.4, 151.7) > mean(machine1)  150.44 > mean(machine2)  151.07
Right off the top, R tells us that the mean of these two small datasets is 150.44 volume for Machine #1 and 151.07 for Machine #2.
Machine #2 appears to have a higher average (making it appear better so far.)
The Median will show what the middle point of the data:
> machine1_sorted = sort(machine1) > machine2_sorted = sort(machine2) > median(machine1_sorted)  150.6 > median(machine2_sorted)  151.2 > median(machine1)  150.6 > median(machine2)  151.2
I’m not sure if R requires the data sets to be sorted. The sort command turned out to give me the same result regardless of sorting the values. To find the Median, we need to have a sorted list – but it appears that R is doing the sorting for us in the median method.
Machine #2 appears as the winner yet again.
A useful tool is to quickly get the range (the length of the data points.)
> range(machine1)  147.5 152.9 > range(machine2)  150.2 151.9
Range simply gives us the range of values. Machine1 has a higher highest value, but a lower lowest value.
R Provides a way to get the quantile values from data sets. The Quantile is a measure of where the data congregates. Like a histogram it shows us where the data sits. This is much better than the range function.
Where range just shows the lowest and highest values in the data, quantile shows the data range (0% and 100%) but also the data at the 25%, 50% and 75% quantile.
> quantile(machine1) 0% 25% 50% 75% 100% 147.500 149.475 150.600 151.275 152.900 > quantile(machine2) 0% 25% 50% 75% 100% 150.20 150.55 151.20 151.40 151.90
Inter Quartille Range (IQR)
IQR (interquartile range) gives us the range at where 50% of the data sits. The reason this is important is due to outliers. Imagine you have data on the price of hotdogs in town. 50% of the data might range from $5.00 to $8.00, but imagine there was a guy selling hotdogs at $0.50, or some fancy restaurant sells hotdogs at $18.50. Those outliers are beyond the normal and skew the data considerably. They might be important to consider, but they might be out of scope from a test.
Interquartile ranges help define where the bulk (not the outlier) of the data is sitting. This bulk measures 50% of the data set that is centered around the mean. To manually calculate this, take the 75% quantile and subtract it by the 25%.
In R (and Python) there are methods to calculate this automatically:
> IQR(machine1)  1.8 > IQR(machine2)  0.85
Machine #2 has more values closer to the mean. Meaning that the sample of Machine #2 has more data closer to the mean. Machine #1 has a value of 1.8, which is a larger spread of the data – also confirmed in the range having a lower lowest and higher highest value.
More precise than the IQR, is standard deviation, which is discussed elsewhere in this blog. It’s a formula that calculates the spread of data from the mean. The lower the value, the more centralized the data is to the mean (closer to the mean.)
> sd(machine1)  1.552203 > sd(machine2)  0.5907622
Again, we see Machine #2 has data from our sample, closer to the mean. Since its mean is a higher value than Machine #1 it seems that all indicators point to Machine #2 as the winner.
I actually didn’t construct a histogram to answer the question. The instructor did using the following R command. This visually explains the data dispersion we’re seeing in the IQR and Standard Deviation:
> hist(machine1, main="Machine 1", xlim = c(145,155)) > hist(machine2, main="Machine 2", xlim = c(145,155))
Which produce the following graphs: