English speakers who are 18 or under use the word ‘like’ in conversation over five times as often as speakers who are over 70; ‘because’ is the most misspelled English word globally; the word ‘love’ is said and written over six times more frequently than the word ‘hate’. We know all of this because of a multibillion-word database called the Cambridge English Corpus.

For learners of English to become proficient, subtle differences can be extremely important.

Claire Dembry

If the Cambridge English Corpus, created by Cambridge University Press, were to be printed on single-sided A4 paper and stacked into a tower, it would stand 600 m high, almost twice the height of the tallest building in the UK. If it was read aloud at an average reading speed, it would take 88,766 hours to read; working 7 hours a day, 5 days a week, that’s 49 years.

The multibillion-word Cambridge English Corpus is a constantly updated record of how English is being used today in all its forms – spoken, written, business, academic, learner and e-language. Amassed over two decades, the electronic database draws on sources that range from the more expected (books, newspapers, journals, radio, television) to the more surprising (song lyrics, junk mail, voicemail messages and recordings from flight control).

Cambridge University Press researchers use the Corpus to investigate the most common words, phrases and grammatical patterns in English, and then use the results to improve English language teaching books.

“Context in English is important,” explained Dr Claire Dembry, Language Research Manager, “we analyse patterns in language and how English changes depending on context and circumstances. For learners of English to become proficient, these sorts of subtle differences can be extremely important, and it is only by amassing a vast number of examples that our writers, lexicographers and researchers can determine how best to describe the patterns of English in our learning materials.”

It all began in the 1990s, when a few CDs of American newspapers in electronic form were loaded into a database that both stored the data and ‘queried’ it, working out the relationships between words. Gradually, the embryo corpus was extended with further material and, today, almost any conceivable form of English can be found in the database.

At an early stage, Cambridge University Press realised that just as important as knowing how English is being used, is the knowledge of the features of English that learners find difficult. “This decision, which led to the Cambridge Learner Corpus, had far-reaching effects and has become probably the single most important unique selling point for the Press’s English Language Teaching publishing,” said Ann Fiddes, Global Language Research Manager.

It turns out that words such as because (misspelled as becouse), which (wich), accommodation (accomodation), advertisement (advertisment) and beautiful (beatiful) are the top five words most commonly misspelled by learners globally.

To arrive at conclusions like this has taken years of painstaking identification (and tagging with computer readable codes) of misspellings and grammatical errors made in Cambridge English Language Assessment Examinations in the Cambridge Learner Corpus.

Comprehensive information about the learners who originally wrote the exam scripts – first language, nationality, age, gender, scores, and so on – is stored.  These data, along with the ‘error tagging’, has enabled Cambridge University Press to publish materials addressing directly the different types of errors of individual markets and individual language groups.

“This is hugely important for the Press and has meant that we have, for example, been able to publish the successful English for Spanish Speakers editions of global products, and become the market leader in Corpus-based publishing,” explained Fiddes.

Now, Cambridge University Press and Cambridge English Language Assessment have joined forces and set their sights on academic English.

The Cambridge English Corpus already contains over 400 million words of academic English – the largest and most extensive collection of its kind.  It takes as its source written and spoken academic language at undergraduate, postgraduate and professional level from a range of academic disciplines and worldwide institutions. New research is pulling in data from sixth-form students as well as other academic levels, covering a much wider range of disciplines, genres and language backgrounds.

“Some interesting patterns have already emerged,” said Fiddes. “In our collection of academic English samples, the size adjectives significant, considerable, substantial and serious are much more frequent than big, massive, enormous and tremendous. In spoken English, however, big tops the list. We also found that in academic English, verbs such as solve, pose, face, resolve, tackle and circumvent frequently occur with the noun problem. These kinds of insights help us to develop a better understanding of the language skills needed by students at English-speaking universities.”

As part of their current research, the team welcomes contributions of academic English to the corpus, and invite anyone interested in participating to contact them for more information (www.cambridge.org/camcae).

“Corpus work is very closely linked with advances in technology and we are investigating automating many of our manual systems, such as error tagging and speech transcription,” added Fiddes. “Our research has already allowed us to partially automate the mark up of errors in learner writing.

“These technologies will increase the speed at which we can maintain our grasp on what English is now, and what it might be in the future. ”

For more information about the Cambridge English Corpus, please visit www.cambridge.org/corpus


This work is licensed under a Creative Commons Licence. If you use this content on your site please link back to this page.