open book pages

Enabling computers to understand language might help users to overcome online information overload.

To resolve these ambiguities, people use the surrounding text and their knowledge of the world. But this is much harder for a computer to do and has been a long-standing challenge in computer science.

Searching for information on the internet has become second nature. Yet, many searches return too many documents and it takes time to examine each one and find the information we need. Scientists are particularly familiar with this problem when searching the scientific literature. How can professional curators, trained to identify information in a scientific article and enter it into a database, mine the text more efficiently? A multidisciplinary team in Cambridge led by Professor Ted Briscoe in the Natural Language and Information Processing (NLIP) group at the Computer Laboratory has been investigating whether computerised methods for analysing the language of scientific articles can facilitate database curation.

Resolving ambiguity in language

Language is ambiguous: in the phrase ‘I saw Bill with the telescope’, who is holding the telescope – you or Bill? To resolve these ambiguities, people use the surrounding text and their knowledge of the world. But this is much harder for a computer to do and has been a long-standing challenge in computer science. Natural language processing (NLP) provides a way for the computer to treat each word as a piece of a puzzle. Each piece is combined with other pieces using knowledge about the structure of English in the context of the subject. Although not an exact science, NLP can be exploited to ameliorate tasks such as curation by attempting to overcome ambiguity in phrasing and draw out relationships between words.

FlySlip

The NLIP group has teamed up with the FlyBase-Cambridge curation team in the Department of Genetics, one of the three members of the international FlyBase consortium that provides the largest database repository of genetic information on the fruit fly. The result, FlySlip, is funded by the Biotechnology and Biological Sciences Research Council (BBSRC) with the aim of developing text information extraction tools to assist FlyBase curators.

There is a strong need to develop such tools not only because of the enormous publication rate in genetics but also because of the particular difficulties associated with automatic recognition of fruit fly gene names. New gene names are constantly being introduced in the literature and many are the same as common English words such as ‘not’, ‘an’, ‘was’ and ‘if’. Andreas Vlachos and Caroline Gasperin, PhD students in the NLIP group, have been investigating the use of statistical NLP techniques for recognising gene names and for disambiguating expressions such as ‘this gene’.

The NLIP group has also been the first to investigate what is the most useful way of presenting the NLP analyses to curators. Dr Ian Lewin and Dr Nikiforos Karamanis have been working on the design and evaluation of a unique curation interface. The interface uses automatically recognised gene names and disambiguated expressions as mechanisms to navigate the text. FlyBase curator Dr Ruth Seal has provided them with domain-specific expertise.

Beyond FlyBase curation

The FlySlip team has demonstrated that the NLP-powered interface enables curators to interact with articles quickly and efficiently. Worldwide, more than 80 curated databases similar to FlyBase exist, and similar approaches are likely to be adopted by other curation groups. In the future, NLP technology may also be used to support everyday tasks such as internet searches. So, next time you feel overwhelmed by the online ‘data deluge’, remember that a solution might be on the way.

For more information, please visit the NLIP group website (www.cl.cam.ac.uk/research/nl). The author, Dr Nikiforos Karamanis(nk304@cam.ac.uk), is a participant on the Rising Stars scheme.


This work is licensed under a Creative Commons Licence. If you use this content on your site please link back to this page.