Here’s how media make data lie ( and how to spot it )

Brice Vergnou
5 min readMar 20, 2022
By Lukas on Pexels

You can make the data say whatever you want it to.

You may have already heard this phrase before. You may also have used it yourself. But is it really something we can agree on?

Most of the time (not to say “always”), the answer is no. Data is neutral and always shows the truth at a given moment, hence making it theoretically impossible to make a lie from it. But in this case, how could we explain all the different practices which allow us to give fancy figures to the spectators?

The problem itself isn’t the data, but the way we interpret it. But to make it a bit clearer, I’ll walk you through some examples before giving you general tips.

A study-case example: Fatal Police Shootings in the US

This example will be using the free dataset about police shootings in the US and my data analysis project about it.

Making the data lie

Let’s imagine I ask you to make a chart of fatalities depending on the race. Naturally, we would make a chart(a pie in this example) using the number of fatalities by race over the total number of fatalities. You would have something like this :

Image by author

And then we would tell everyone that in reality half of the police victims are white. Seems a bit off right? The figures are real, I didn’t add or remove any record from the database, so it must from our plot…but how?

We should have asked ourselves what we were looking at before announcing the results. Our goal was to see if race was a significant factor, while here we’re wondering what is the distribution of victims. White people take a huge place in this plot just because the share of white people is simply way more important, so this plot can’t solve our problem.

We would need to see how much chance you have to be a victim of police shootings depending on your race, sounds like Conditional probability will be useful.

Solving the problem

You may remember conditional probability when you were in high school. It is a way to express the probability of an event occurring given another already occurred.

From onlinemathlearning

Here, we will look for X ethnicity, what is the chance of dying from the police. This way, the population share won’t matter. However, we can’t use the formula above because we don’t have the right figures. But we still have another possibility: we can look at the number of deaths for X ethnicity over the population of this ethnicity.

First, we create a temporary dataset and note the American population at that time (2016)

Then we calculate the population per ethnicity

We get the conditional probability for each ethnicity

And plot the graph!

Giving this plot:

Image by author

And NOW you realize what is the truth. With the same figures, you went from showing that half of the victims were white to showing you are 5x more likely to die as a black person, all of this with the same data. We just changed the way we perceive the problem.

General mindset

So…we’ve seen a concrete example of giving “fake” results from your data, but what if find some in the wild?

Generally, ask yourself what is the insights these people are trying to give you and whether or not this insight is relevant. Here are a few examples of the main ways to make “fake data”:

  • Biased sampling: the people surveyed make a specific group in the population based on a criterion (e.g. asking only people from the lower class whether or not their work and the recognition that comes with it satisfy them)
  • Choosing the average instead of the median: the average can be a great tool to show trends, but when there are outliers in your data, the figures aren’t as they should be (e.g. the average salary in x country will always be higher than the median salary because the millionaires and billionaires make the average skyrocket)
  • Cumulating figures: Count the sum of something in an area instead of taking a ratio to be able to compare with other areas (e.g. in 2018, China has the most important GDP with 25 billion. But when you look at the GDP per capita, China is only 79e with $14.000 per capita. This is due to the fact China is the most populous country, making the cumulating figure more advantageous).
  • A poorly made visualization: tweaking the y-axis to exaggerate or devalue data. Look at this graph for example:
Source
  • Correlation in uncorrelated features: a few years ago, my economy teacher showed us a plot “proving” the correlation between the amount of chocolate eaten and the number of Nobel Prizes by country, to conclude chocolate makes people smart:
Took from Twitter, can’t find the original paper

The problem with this graph is the fact it hides a third feature, which is wealth. Nobel Laureates comes from education which comes from wealth, and the same goes for chocolate consumption. Watch out for this kind of situation where there is a third feature or where the features in the plot are not related.

Thanks for reading my article, I hope I could give you some intuition to understand how to spot the abuses of data analysis and how to avoid doing the same.

If you like my content, please consider following me on Medium or Twitter :)

Happy learning

--

--