The Difference Between Correlation and Causation

Information is everywhere today and the biggest challenge is we don’t have many ways to ensure it is credible. More often than not, we just assume what we hear, read, or see is factual, although that has started to change slowly in the current media climate.

While misinformation at large is a major issue, it’s even more dangerous in the world of statistics. This is a field that is based on numbers, one in which you would hope you could rely on and yet, it can be a major culprit for this exact problem.

Mark Twain once said, “There are three kinds of lies – lies, damned lies, and statistics.” The problem with statistics is that with enough wizardry, you can get the data to say pretty much anything you want.

There are three kinds of lies – lies, damned lies, and statistics.
Mark Twain

Adding or removing data strategically can dramatically change the outlook of any analysis. While such instances are more nefarious, we can all be subject to a more simple challenge when it comes to data: the misunderstanding of correlation and causation. Let’s dig a little deeper.

Correlation vs. Causation

The general premise is that when a relationships is discovered between two variables, the logical next leap is to assume that one variable causes the other. This could be true, but more often than not, it is false.

This is what statisticians call a logical fallacy. More specifically, we could call this a hasty generalization and it’s an easy trap to fall into. When some major change happens, we try to pinpoint the cause. This often looks like linking the outcome to a specific event or change that happened around the same time. We at best have uncovered a correlation, but NOT causation.

What’s the difference? Correlation simply shows that two variables have a visible relationship. But that’s all it means. It does not mean that variable X directly impacts variable Y or vice versa. We have a hypothesis, but far too often we turn this correlation into truth.

Entire books have been written about the nonsense that ensues if we rely on such a simplified view on causation. Here are some examples to get the point across.

How about linking US spending on science, space, and technology to the rate of suicide. Or maybe, how drowning in a pool is impacted by films Nicolas Cage appears in. Both equally ridiculous, yet the data makes a case:

These are just two of a number of examples from the insightful book Spurious Correlations that truly makes the case against leaping to conclusions about correlations.

One of my favorite examples is from an editorial published in a newspaper some years ago that argues the case that speaking English kills you. You can’t argue with this logic:

It’s a relief to know the truth after all those conflicting medical studies. The Japanese eat very little fat and suffer fewer heart attacks than the British or Americans.

The French eat a lot of fat and also suffer fewer heart attacks than the British or Americans.

The Japanaese drink very little red wine and suffer fewer heart attacks than the British or Americans.

The Italians drink excessive amounts of red wine and also suffer fewer heart attacks than the British or Americans.

The Germans drink a lot of bear and eat lots of sausages and fats and suffer fewer heart attacks than the British or Americans.

Conclusion: Eat and drink what you like. Speaking English is apparently what kills you.

While most of the times this sort of thinking is harmless, it can also be very dangerous, especially when such misconstrued data is used to make important decisions. Correlation can easily be a one-off that won’t hold up the test of time. It doesn’t mean that changing X is going to directly impact Y. We have to keep that distinction clear when looking at data.

All this is to say that correlation is not causation.

Steps For Evaluating Data

With this in mind, there’s a few things we would all be better off doing when looking at data.

1. Look for correlation

This is still the natural first step. We’re trying to find causation and a good place to start is with variables that show correlation. What changed? What impacts the variable you’re tracking? What explains this behavior?

The important thing is not to stop here.

2. Test your hypothesis

Here is where we often go wrong. We analyze data, we find a correlation, we assume that is the reason behind what we are seeing, and we take action on that.

The problem is, it’s more likely this is not the problem at all, but how do we know? We need to test our hypothesis, whether that’s AB testing or some other methodology. Until we can isolate these variables and track their impact directly, we can’t truly conclude any sort of causal relationship.

We must test correlations to truly identify causation.

Dealing with correlation and causation isn’t simple by any means, but it is important.

Correlation vs Causation – XKCD

Simply being mindful of this distinction goes a long way. And when you have the opportunity to prove causation exists, you should take it.