Data Science and Big Data is a growing part of the business world. Often lauded as a crystal ball to predict the future or cure cancer it's easy to implicitly trust the data wizard and their expensive cluster of servers. However, doing meaningful data analysis is difficult and fraught with potential errors and incorrect conclusions.
In this article I will outline some of the challenges of data analysis and provide you with a series of questions to ask of any data analysis to keep the Data Scientist's honest, and to help you look like you know what you're talking about!
We generally can't observe the thing we want to measure
Data doesn't lie, but it is often misrepresented. In data analysis, the question we want answered is rarely directly observable. Rather, we look to phenomena that are related to the question we want answered which we can measure. Hopefully the thing we can measure is at least correlated with what we want to measure.
An example of this is in counting unique people who view an ad. We cannot really measure this directly, as doing so would require being able to tell when the same person has used a different device, when someone passes an iPad to a friend, or when someone has the ad blocked.
Rather than saying it's unknowable, we use the data that we have. We count up unique IP addresses, or cookies or use browser profiling. The point here is that there are a number of assumptions underlying the analysis (e.g. 1 IP address == 1 person), that may not be correct.
The strength of the relationship between what we can measure and what we want to measure is critical.
Errors often look like interesting things and vice versa
A lot of data analysis revolves around discovery - finding interesting bits of data that tell an interesting story, or in finding relationships in the data.
There is often a fine line between something interesting, and error. Interesting things are often interesting because they deviate from the norm. These are also the most likely things to be error. Thinking long and hard about what an error may look like, and how something that looks like an error might be something real is critical to making decisions about how to evaluate outliers.
Evaluating data analysis
From my time as a cancer researcher, I learnt to be very sceptical about data and how it is processed. In any analysis, there are so many steps where data can be massaged to create (unintentionally) biases that may influence the outcome of the analysis.
It is important to note that errors and biases accumulate with each step of analysis. An error in the source data, will often make it through to filtering and analysis stages, potentially leading to an incorrect conclusion.
The source data is the foundation of any analysis. Is the source data a direct observation of the phenomenon that you are interested in, or is it a proxy? If it's a proxy (and it probably is), how good of a proxy is it? How honestly is the source data portrayed in the analysis? What are the known biases in the source data?
Filtering of data
Most data sets are dirty, in that it contains systematic biases or data known to be irrelevant to analysis. For example, in counting unique visitors to a website from log information we almost certainly want to exclude traffic from the Google Bot and other search indexers, since they don't represent real people.
Filtering has the capacity to color any data analysis in very real ways. For the bad actor, it's simply a case of filtering out any data points which run contrary to the conclusion they wish to draw. For the inexperienced but honest data scientist, the results are likely to be subtle, but still dangerous.
There are literally hundreds of different statistical methods that can be applied to a data set (not should be). This creates a situation similar to the problem of multiple comparisons, whereby a bad actor can keep trying different statistical methods until they land on one that delivers the outcome they want.
Machine Learning, when placed in the hands of an inexperienced analyst or data scientist is particularly dangerous. Many tools, especially the venerable R, enable you to easily run analysis through Support Vector Machines, Decision Tree's, GLM's and Neural Networks without much work, so it's tempting to the time-constrained data scientist to develop a script that runs the data through all the models and then chose the one with the best result, without any thought to underlying statistics or the models assumptions.
Cross validation goes a long way to help validate a model, but it starts to fall apart if you are continually cross validating, because you are in effect using the entire dataset to choose your model (e.g. there is information leakage with each test run).
Beware of complexity
Very few non-technical people understand the inner workings of Support Vector Machines, Neural Networks or other statistical methods beyond simple averages - so they are often treated as black boxes. Because they seem so complicated, they wither create an implicit trust ("oooh, that's advanced / cutting edge it must be good!"), or an implicit distrust. I definitely fall in the distrust camp. If a model is too complicated to explain to me with the aid of a whiteboard, the data scientist better have a lot of supporting evidence of the model working on unseen data.
Where big data helps
More data gives us more confidence in rare events, since with 100 data points something that occurs with a rate of 0.1% is unlikely to show up in our analysis. However, with 1,000,000 data points, we should see it 1,000 times. This makes rare events more obvious.
More data also gives us the ability to more thoroughly test models. Where data sets are small, it can be challenging to accurately test a model, as there is simply not enough data to test it thoroughly. With more data, the model can be built with a smaller percentage of the data, then tested on larger data sets or more times. This can help lend confidence to the efficacy of a model.
Larger data sets can give us more confidence that what we are seeing is real, but this effect is generally marginal. The law of large numbers tells us that as the sample size tends to infinity, the same mean tends to be close to the population mean, and the central limit theorem tells us that as the sample size tends to infinity the shape of the distribution tends to be normal. Thus a bigger sample is nice to have, but it makes a smaller and smaller difference as the data set gets bigger and bigger.
Where big data doesn't help
More garbage isn't useful.
Big data generally refers to having more data not better data. Large data sets are still proxies for real world phenomena it's just that they generally involve counting more items.
Collecting data at scale is more difficult and error prone
Collecting billions of events is still challenging to do reliably. This means that a really large data set is more likely to have holes in it (particularly during peak periods, where systems become flooded and services begin turning off to maintain functionality in more important areas).
Larger data storage systems also tend to use "eventual consistency" across clusters of machines rather than the ACID style guarantees more common in traditional relational databases. This makes these systems more likely to not only miss data, but miss-record data (e.g record one event twice).
Bigger data sets inspire more confidence unduly
A miss-interpretation of high school statistics may lead people to trust larger datasets more than they should be. An ever shrinking p-value is of little relevance if the raw data is wrong, or measuring the wrong thing.
Big data makes analysis more difficult
> Big data is a blanket term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications. [http://en.wikipedia.org/wiki/Big_data]
By definition big data is more difficult to analyze, which in turn makes it more difficult to assess the quality of the analysis. Big data is often analyzed through clusters of computers, which means more moving parts and more things that can potentially go wrong, and therefore for errors to creep in.
Key questions to ask of any data analysis
What is actually being measured in the source data?
How closely related is what is being measured to the actual phenomena we care about? What work has been done in proving that relationship?
How is the data being collected? What technical problems may exist in the collection?
The processes used to collect data are rarely perfect. Many applications when highly loaded may turn off logging to keep the application from failing, leaving holes in your data. For sensor based data collecting real world samples, things are even more complicated.
How is the data filtered / processed? What rules are used to exclude data?
Make sure that the Data Scientist / Analyst actually knows how they have filtered data. If they haven't used any filtering, asking why is important, as it's critical to see that they have thought about data quality and done some basic analysis on how good the data is.
What type of model is being used? How was it selected? What other models were tried and why did they fail?
Ensure that the Analyst / Data Scientist can explain why one model worked better than another.
How does the model perform on new data?
Try to keep a separate data set that the Data Scientist / Analyst does not have access to that be used to verify their model after they have delivered their results. This is really the only way to get an honest indication of how well the model performs.
Analyzing data is hard to do well. The idea behind this article is not to say that all data analysis is wrong or that all Data Scientists are charlatans. Rather it is to acknowledge how difficult the work can be, and to give you the tools to question the analysis rather than just follow it blindly.