What is statistics?
23 Oct 2020
Statistics is the technology of extracting information, illumination and understanding from data, often in the face of uncertainty. This information can then be used to inform decisions and actions, ranging from medical diagnosis, to corporate or government planning, to finance and elsewhere. Statistics is ubiquitous, impacting all walks of life, although often much of it is behind the scenes.
I like to think of statisticians as the modern equivalent of explorers of old, but instead of being limited to geographical vistas they explore many different domains, as revealed by data. You can see from this that statistics is a very exciting discipline, quite the opposite of the common perception that it is dry and dusty. That image is a legacy of the past, when arithmetic facility was necessary, but the computer has removed that need. Modern statistics uses advanced software tools and probes data in ways that would have been impossible to imagine not so long ago.
Statistics is sometimes seen as a part of mathematics, but in a real sense the two disciplines have diametrically opposed aims. In caricature, mathematics takes an artificial world (a set of axioms) and aims to deduce the consequences – what the world would look like. In contrast, statistics takes the consequences (the data) and aims to deduce what kind of world could have produced those data. Of course, statistical tools are described in mathematical terms, but so are the tools of surveying, accountancy, physics, economics and so on, and they are not seen as part of mathematics.
Policy and Research
Data and AI
Projects on data governance and on the impact of AI on the future of work.
The core of data science is statistics, supplemented with some computer science for manipulating data along with domain knowledge about the problems and properties of the data being dealt with. Likewise, statistical concepts and methods lie at the heart of machine learning and artificial intelligence, which can be thought of as statistical systems which adapt to incoming data.
Probability plays a central role in modern statistics. This is because data are seldom perfect, having associated uncertainty, and also because the aim is often to make an inference to a population from a sample of values – for example, to estimate the average income within a country based on observing only some incomes, or to see if treatment A is better than treatment B in a clinical trial based on only a few hundred people. It will be obvious that there are dangers in this: if you collect income data solely from people who work in the City of London you are likely to obtain a biased result, and likewise if you give treatment A preferentially to the sicker people. Separate branches of statistics, notably survey sampling and experimental design, are concerned with how best to collect data to avoid such problems – who you should approach in a social survey, who should get which treatment in a clinical trial and so on.
Statistics is often seen as merely concerned with aggregate phenomena, summarising masses of data, but many applications are very much about the individual: data collected from the many are certainly summarised, but then the summary is combined with data about the individual to inform decisions about that individual. For example, to determine which treatment is most likely to benefit someone, or to determine whether someone is in a high-risk category for insurance.
Statistics has sometimes suffered from a bad press in the past – you will have heard the old remark about “lies, damned lies, and statistics”. The truth, however, is that, while yes, it is possible to lie with statistics, it is a damned sight easier to lie without them.
David Hand FBA is Emeritus Professor of Mathematics at Imperial College London. He was elected a Fellow of the British Academy in 2003. His books include Statistics: A Very Short Introduction, The Improbability Principle and Dark Data.