When performing data analysis, it can be easy to slide into a few traps and end up making mistakes. Diligence is essential, and it’s wise to keep an eye out for the following 7 potential mistakes you can make. These include:
- Sampling bias
- Cherry-picking
- Disclosing metrics
- Overfitting
- Focusing only on the numbers
- Solution bias
- Communicating poorly
Let’s take a look at why each one can be problematic and how you might be able to avoid these issues.
The Why
Sampling bias occurs when a non-representative sample is used. For example, a political campaign might sample 1,300 voters only to find out that one political party’s members are dramatically overrepresented in the pool. Sampling bias should be avoided because it can weigh the analysis too far in one particular direction.
Cherry-picking happens when data is stacked to support a particular hypothesis. It’s one of the more intentional problems that appear on this list because there’s always a temptation to give the analysis a nudge in the “right” direction. Not only is cherry-picking unethical, but it may have more serious consequences in fields like public policy, engineering, and health.
Disclosing metrics is a problem because a metric becomes useless once subjects know its value. This ends up creating problems like the habit in the education field of teaching to what’s on standardized tests. A similar problem occurred in the early days of internet search when websites started flooding their content with keywords to game the way pages were ranked.
Overfitting tends to happen during the analysis process. Someone might have a model, for example, and the curve produced by the model seems to be predictive. Unfortunately, the curve is only a curve because the data fits the model. The failure of the model may only become apparent, however, when the model is compared to future observations that aren’t so well-fitted.
Focusing only on the numbers is worrisome because it can have adverse real-world consequences. For example, existing social biases can be fed into models. A company handling lending might produce a model that induces geographic bias by using data derived from biased sources. The numbers may look clean and neat, but the underlying biases can be socially and economically turbulent.
Solution bias can be thought of as the gentler cousin of cherry-picking. With solution bias, a solution might be so cool, interesting or elegant that it’s hard not to fall in love with. Unfortunately, the solution might be wrong, and appropriate levels of scientific and mathematical rigor might not be applied because refuting the solution would just seem disheartening.
Communicating poorly is more problematic than you might expect. Producing analysis is one thing, but conveying findings in an accessible manner to people who didn’t participate in the project is critical. Data scientists need to be comfortable with producing elegant and engaging dashboards, charts and other work products to ensure their findings are well-communicated.
How to Avoid These Problems
Process and diligence are your primary weapons in combating mistakes in data analysis. First, you must have a process in place that emphasizes the importance of getting things right. When you’re creating a data science experiment, there need to be checks in place that will force you to stop and consider things like:
- Where is the data coming from?
- Are there known biases in the data?
- Can you screen the data for problems?
- Who is checking everybody’s work?
- When will results be re-analyzed to verify integrity?
- Are there ethical, social, economic or moral implications that need to be examined more closely before starting?
Diligence is also essential. You should be looking at concerns about whether:
- You have a large and representative enough sample to work with
- There are more rigorous ways to conduct the analysis
- How you’ll make sure analysts are following properly outlined procedures
Tackling a data science project requires sufficient and ample planning. You also have to consider ways to refine your work and to keep improving your processes over time. It takes commitment, but a group with the right culture can do a better job of steering clear of avoidable mistakes.