OK, so it's a provocative title, but I'm trying to draw people in who have a warm, fuzzy and positive feeling about the term "big data", and point out some cautions using language that is, I hope, broadly accessible.
As Terry Speed points out in Data Science, Big Data and Statistics – can we all live together?, the term "big data" is gaining in popularity. Here's what Google Trends has to say about the term:
...and here's Terry at full-strength on this important topic
But how can "big data" (or perhaps more accurately "more data") be a bad thing? Surely it tells us more about the world and allows us to make more confident predictions about the future?
Risks with "big data"
One of the risks with "big data" is that we forget about how that data was gathered, and how that shapes the conclusions we draw. For example, if we study the health of people visiting their GP, these observations alone do not allow us to draw conclusions about the health of the general population, no matter how much data we collect.
A second risk with "big data" lies in "shape" of the data.
- Some data sets are big because there are lots of independent observations, each of which measures a relatively small number of things. A population census is a good example
- Some data sets are big because each observation generates a lot of measurements. The sequence of a person's genome, the DNA inside their cells, is a good example. One person's genome will have around 6.4 billion "letters" of DNA in it (3.2 billion from mum, 3.2 billion from dad)
The risk when the number of measurements exceeds the number of observations is that we have potentially an infinite number of ways to make accurate predictions from that data.
For instance, if we measure the DNA of 50 people who have a disease, and 50 people who don't, then there are an infinite number of ways to relate those 100 genomes to whether the people studied had the disease.
The challenge (and this is an area of very active research) is to create predictive models of this kind of "big data" that work well (i.e., make accurate predictions) on new observations.
With our disease/non-disease example, we would like to create models of the relationship between a genome and the disease status that we could use confidently outside the 100 people who were studied.
There are other risks posed by big data, including our potential to misinterpret the fact that as we gather more data, all predictor variables become more statistically significant.
Fortunately, there are voices sounding notes of caution about this, a sample of which appear below.
If you have found a useful article that puts "big data" into perspective, then please let us know so we can add to this list.
- Big Data: are we making a big mistake? Tim Harford, Financial Times