Big data has captured the world’s attention, with talk of a new Industrial Revolution based on information, and of data being one of the 21st century’s most valuable commodities. Today, we commence a month-long focus on research that uses, produces and interrogates huge datasets.

Cambridge researchers are at the forefront of solving many of the challenges that big data presents

Paul Alexander and Clare Dyer-Smith, Cambridge Big Data

Our unprecedented ability to collect, store and analyse data is opening up new frontiers in science and the humanities, from extending our knowledge of how the universe is built, to creating new understanding of the genetic basis of disease, to discovering the impact of schools on pupil achievement.

It’s causing us to challenge not only long-held ideas about what is possible in research, but also to reflect on the value that we place upon ever-increasing quantification and the effect of pervasive data collection on our role as citizens.

‘Big Data’ has also been highlighted by the UK government as among the country’s ‘Eight Great Technologies’ that will help drive economic growth.

But what actually is big data? Collecting and analysing data on individuals, societies and all aspects of the natural world, is routine in research. So why the current preoccupation with the ‘bigness’ of data?

Part of this is the sheer deluge of data that we are now able to collect and store. Back in 2010, Google CEO Eric Schmidt declared that every two days we create as much information as we did from the dawn of civilisation up until 2003. And that was five years ago.

Of course size doesn’t always matter – sometimes big data can mean large datasets that are incredibly messy, with missing or corrupt information, which requires complex mathematical algorithms to make sense of it all.

Another aspect of the current interest in big data is the realisation that just because we can collect data doesn’t mean we are making the best use of it.  In fact, big data is often described as data exceeding our ability to handle it, and for which new analytical methods are required to extract useful information. But this is clearly a moving target, and research is urgently needed to keep up.

Recognising this, the government announced earlier this year that the new £42 million Alan Turing Institute, to be based in London, will carry out research in organising, storing and interrogating big data, headed by the universities of Cambridge, Edinburgh, Oxford, Warwick and UCL.

A more subtle distinction that sets ‘big data’ apart from just ‘data’ is in the combination of factors that describe it, the so-called ‘Big Vs of Big Data’, including:

  • its Volume in terms of size, or the number of variables needed to describe it,
  • its Velocity, in how it’s collected and processed in real-time,
  • the Variety of its diverse, linked and unstructured datasets,
  • its Veracity in terms of where it originates from and how complete it is,
  • and finally its Value, and especially how the use, linkage and re-use of data can provide crucial new insights that may have been unforeseen at the time the data was collected.

Cambridge researchers are at the forefront of solving many of the challenges that big data presents. New machine learning algorithms are being developed, to the point that machines can now automate some very human tasks, like image recognition, reading and annotating text, or even writing documents in plain English.

For the very biggest big data projects such as the Large Hadron Collider and the Square Kilometre Array, Cambridge researchers are designing new methods to deal with huge data volumes and clever analytics to understand the fundamental makeup of matter and the origins of the universe.

Developing new mathematics and the algorithms to handle this data explosion go hand in hand with improving the storage and computing infrastructure underlying the big data revolution. Cambridge is home to Wilkes, one of the world’s most energy-efficient supercomputers, while scientists in the Computer Laboratory are building a test-bed data centre, where large-scale ‘experiments’ can be done in a safe environment to optimise the data centres of the future.

And what of the data that we personally generate in our everyday lives? Our researchers are asking whether what we share on social media could tell us about ourselves, how best to take the data captured in hospital records to improve patient treatment and if newly available datasets can be used to uncover corruption and misuse of public funds.  

Questions of privacy, anonymity and consent are a crucial component of our research, and require engagement from governments, lawmakers, ethicists and, by consultation, the public. In a world where datasets can be linked together, where the breadcrumb trail that we leave as we go about our online and offline lives can be recorded, analysed and converted into a detailed picture of our behaviour, movements, actions and thoughts, a radical change to the way we conceive of these notions is taking place.

In order to make practical use of what we learn from these data, distillation and visualisation of the central information is key, and doing so in a way which is robust, comprehensible and actionable relies not only on advances in science but in our statistical literacy and a critical understanding of what the data can tell us, and what is left out.

And in the rush to convert our world into pixels and numbers, perspectives from the humanities and social sciences are needed more than ever before. What are the implications of data-driven hypothesis development for the practice of science, and for how evidence is used to make policy? What room does this data-driven society leave for aspects of the world that cannot be captured by computers? How does big data supplant or shore-up existing power structures?

To answer these questions requires a meeting of minds – between those working with the statistics, mathematics and algorithms underlying the new field of ‘data science’ and the computer scientists and engineers who are building systems to store and manage data, along with those who can offer perspectives in the social sciences and humanities to ensure that big data can deliver benefits in a sustainable and appropriate way. To this end, in 2013, the University created Cambridge Big Data, a Strategic Research Initiative that brings together a diverse research community to allow Cambridge to respond to the ever-increasing challenges of big data.

Professor Paul Alexander (Chair) and Dr Clare Dyer-Smith (Coordinator) Cambridge Big Data

Creative Commons License
The text in this work is licensed under a Creative Commons Attribution 4.0 International License. For image use please see separate credits above.