I know you’ve heard the phrase “less is more”. Although that is true in many aspects of life, it isn’t true in data analytics. When it comes to data and analytics more really is more. The more data you have the better insights you can derive, the more reliable the outcomes are, and the more accurate your predictions and forecasts.
Although this might seem intuitive because of the world we live in, it is important to understand why. This is because of a few key concepts:
- Proportions and distributions
- Outliers and variability
- Variety and sampling
Proportions and Distributions
To understand this concept, imagine you’re administering a survey. You want to know people’s favourite iced cream flavour. You ask 4 people what their favourite flavour is and you find 1 person liked chocolate, 1 person liked vanilla, 1 person licked strawberry, and the last person liked mint chip. The results of this survey would result in 25% liking each of the flavours. You may think that these 4 flavours are the only ones that a company should produce, you may also infer that there is no preference for 1 flavour over another. Although we know that this is simply not the case, the data would indicate otherwise. The technical term for each of these preferences is called an “observance”. If you expand this from 4 people to 400 people, you might discover that 200 people (50%) say that Chocolate is their favourite. You may also discover that there is more than 4 favourite flavours and the remaining 50% may be divided among 10 different flavours. The insights that you can derive from this greater base is richer and more robust than the 4 people initially surveyed. You may discover that the iced cream company should produce more than double any other flavour. You may also discover that the iced cream company should produce more than the 4 flavours initially discovered and turn a higher profit in more niche markets. The problem is that with too few observances the results will be skewed due to distributions miss-representing the larger population.
Outliers and Variability
The other concept in data analysis is outliers and variability. This concept is also hard to identify in such small numbers. In the original example, the flavours were chocolate, strawberry, vanilla, and mint chip. I can imagine that if you asked everyone you knew, you might not find anyone would say their favourite flavour is mint chip. This is an “outlier”, in the 400 person example you might find only 1, 2, or at most 10 people who say this flavour is their favourite. In the original example 25% of respondents said that their favourite was mint chip, but in the 400 person example it is only 2.5%. When you have too few observations, outliers appear to be a higher representation of the population than they really are. If an iced cream manufacturer based their production and run rates on the 4-person study, they would produce far too much mint chip and not be able to sell it, wasting money and producing a lot of waste. The 400 person study would show that much less mint chip should be produced.
Variety and Sampling
This concept requires a perspective that goes beyond the individual. When it comes to variety and sampling one must consider the largest pool of people being studied. Keeping with the iced cream example, one can assume that the iced cream manufacturer is looking to produce iced cream for the whole nation. This would factor in the preferences of tens of millions of people. Beyond the question of “what is your favourite iced cream?” the iced cream manufacturer is going to want to know more about the people being studied. The idea of sampling comes in when you consider the representation of the whole population. In the country, if there are 50% men and 50% women, then when surveying people this needs to be represented. This would also be true for age, region, and other demographic information. Contrast that with the survey of 4 people, 2 of the people would have to be women, each of them would have to be in a different age bracket, and they would have to be from various parts of the country. Do you see the problem? how can 4 people accurately represent several different age brackets, both sexes, various regions, and other demographic details? If we consider only four age brackets and two sexes then we would need to have a minimum of 8 people! To take this one step further, factoring only 3 regions (west, central, and east) we would need 12 people! This subject on its own has a lot to unpack, but I’m sure you get the point.
For these reasons (and so much more), data tells us more really is more. The more data available, the deeper and richer insights we can draw from the data.