10-Minute Talks: Dark data

by Professor David Hand FBA

18 Nov 2020

Woman looking at wall with code

Subscribe to this podcast via your chosen service

Professor David Hand FBA explores dark data in the context of COVID-19,the many ways in which we can be blind to missing data, and how that can lead us to conclusions and actions that are mistaken, dangerous, or even disastrous.


Hello, my name's David Hand and I'm a Fellow of the British Academy and Emeritus Professor of Mathematics at Imperial College, London, where I previously held the chair of statistics. Today, I'm going to say a little bit about data science.

Data science has become a hot topic in recent years. It promises to revolutionise the world we live in, bringing economic, social and health benefits. But in the past few months, nowhere has the central role of data become more apparent than in the COVID-19 pandemic. Policies, decisions and planning, all balancing health impact against economic impact, must be based on comparing what we think will happen if we take no action, with what we think will happen if we do take action and indeed, on comparing expected outcomes under different actions. All of this is predicated on having a good understanding of the disease and that understanding is in turn based on high-quality data. Good, accurate and timely data are critical for reacting to and managing the pandemic.

Unfortunately, we live in the real world. The real world is beset by dark data. Dark data are data you don't have. This might be because you want today's data, but all you have is yesterday's. It might be because your sample is distorted, perhaps certain types of cases are missing. It might be because the recorded values are inaccurate – after all, no measurement instrument is perfect. It might even be that the data are available, but unexamined, gently decaying in a giant data warehouse, unlooked at because they were collected purely for compliance reasons, for example. In my book, Dark Data: Why What You Don't Know Matters, I give a taxonomy of 15 types of dark data and I'm sure this is not a comprehensive list. New data sources, new ways of collecting data and new types of data are arising all the time and all of these bring with them new opportunities for dark data of new kinds.

Dark data are problematic, because they mean we can draw wrong conclusions. These conclusions in turn lead to incorrect decisions and inappropriate actions. While these might have minor consequences, they can also have major ones.

Fortunes and lives have been lost because of dark data.

Fortunately, statisticians have developed tools to tackle dark data issues and indeed, have gone even further and devised strategies to use dark data to advantage. A simple example of that is a double-blind randomised controlled clinical trial, where the identity of the treatment being administered is concealed from the patient and from the clinician.

In this brief talk, I want to focus on the dangers and in particular, how dark data has manifested itself in the COVID-19 pandemic.

Scientific understanding and political management of the COVID-19 crisis has been riddled with dark data issues. Indeed, Stanford scientist John Ioannidis has called the pandemic "a once-in-a-century evidence fiasco". Unfortunately, as the many examples in my book show, such evidence fiascos are far from uncommon, but they may not always be as overt as in the pandemic.

One of the particular challenges with an epidemic is that at the start you don't have much data to go on. You're largely in the dark, almost by definition. This means that projections must inevitably come with large uncertainty bands. Even an accurate death count can mean that the death rate is lower than you think, if you had an under-count of the number of infected. In the case of COVID, when many infected people are symptomless, this is likely to have been the case.

Gradually, over time, light will be shone onto the dark data, improving understanding. Unfortunately, management decisions typically cannot wait. If initial projections suggest that there's a high probability that the illness will overwhelm the health services, leading to bodies piling up unburied, then action has to be taken now to restrict transmission, even if you don't have all the data. Even if retrospective analysis based on more complete data shows that such extreme actions were unnecessary and that the economic costs outweigh the social costs, that doesn't mean it was a bad decision at the time.

With hindsight, it's easy to say we should have done such-and-such. Moreover, working out what would have happened if other action had been taken is a type of dark data of its own. We can never be sure of counterfactuals. Donald Rumsfeld famously referred to "known unknowns and unknown unknowns". These are two important types of dark data. Data we know are missing and data we don't know are missing.

If we know data are missing, we can try to do something about it, but if we don't know data are missing, we can blunder into terrible errors. While it seems that youngsters suffer only mildly from the illness itself, it certainly has after-effects in adults: so-called long COVID, including lung damage, neurological damage and heart damage. So what about in children? Not to mention possible long-term psychological and sociological consequences arising from the lockdown and the loss of normal social interactions.

The key to coping with an epidemic is detecting as quickly as possible who has the disease, so that they can be prevented from infecting others. The front line of detection is spotting the symptoms, though biological tests are better. The problem, however, in both cases is that they're not 100 per cent reliable. They give false positives and false negatives. Even a totally reliable biological test relies on human administration, so at the bottom line, it's not totally reliable. While symptoms of COVID-19 include fever, headache, cough, breathing difficulties, high temperature and loss of smell and taste, there's a good chance that any one of us has recently experienced at least one of those symptoms without having COVID-19. Having the symptoms doesn't mean you have COVID-19 and perhaps worse, not having them does not mean you don't have the disease.

Dark data problems have arisen in many other ways in the pandemic. People themselves may decide whether to contact health services so that a distorted sample appears, including only those motivated to take action. Likewise, people choose, or do not choose, to download a contact tracing app. It's gradually become apparent that certain subgroups of the population are more vulnerable than others. The aged in males, for example, as well as BAME groups. We can only recognise this if we happen to record the distinguishing characteristics, the age, for example. What about all those characteristics we didn't think to take note of?

Modelling and understanding the evolution of an epidemic is complicated by the actions people take. There's a feedback process, where increased awareness leads to changes in the phenomenon being studied. The impact of this can be dramatic. For example, if restaurant meals and haircuts figure in the basket used to calculate an inflation index, how can we estimate inflation when we cannot go to a restaurant or hairdresser? Imputing values for the missing prices seems dubious, since it is not simply a question of them existing but not being recorded. They just don't exist.

As I record this talk, we're still in a state of uncertainty about many aspects of the pandemic. What factors are influential in predicting how severely people suffer from the illness? How serious are the after-effects of being ill? When will an effective vaccine be developed, indeed, will one be developed? How long will the economy take to recover, and so on and on? One of the striking and, I think, encouraging things about the pandemic is how rapidly we're learning about it, as well as how rapidly we're putting in place systems to learn, to shine a light on dark data.

More generally, the pandemic has also raised awareness of the critical importance of the social sciences. After all, managing the pandemic requires an understanding of how people behave. Perhaps above all, however, the pandemic has raised awareness of the importance of valid, relevant, timely and accurate data and of the risks of dark data. Thank you very much.

This talk originally took place on 18 November 2020, part of the series The British Academy 10-Minute Talks, where the world’s leading professors explain the latest thinking in the humanities and social sciences in just 10 minutes. 10-Minute Talks are screened each Wednesday, 13:00-13:10, on YouTube and available on Apple Podcasts. Subscribe to the British Academy 10-Minute Talks here.

Further reading

Dark Data website.

"Dark Data: Why What You Don't Know Matters" podcast featuring David Hand.

The Improbability Principle website.

Statistics: A Very Short Introduction, David Hand.

Measurement: A Very Short Introduction, David Hand.

Lead image: Woman looking at wall with code. © Stanislaw Pytel via Getty Images.

Sign up to our email newsletters