A typical speech or text does not consist of a random set of unrelated sentences. Generally, the author (or speaker) starts talking about one thing and continues talking about it for a while. While this tends to be true, there is typically nothing in the text that guarantees it:
This is my brother John. He is very tall. He graduated from high school last year.
We usually assume this is a story about a single person, who is tall, a recent high school graduate, named John, and who is brother of the speaker. But it could very well have been about three different people. Although humans are very good at telling which part of a story relates to which other part, it turns out to be very difficult to explain how we know. We just do.
This is a challenge both to psychologists like myself, as well as to people who try to design computer programs that can analyze text (whether for the purposes of machine translation, text summarization, or any other application).
The materials for research
A group at the University of Essex put together an entertaining new Web game called Phrase Detectives to help develop new materials for cutting-edge research into this basic problem of language. Their project is similar to my ongoing Dax Study, except that theirs is not so much an experiment as a method for developing the stimuli.
Phrase Detectives is set up as a competition between users, and the results is an entertaining game that you can participate in more or less as you choose. Other than its origins, it looks a great deal like many other Web games. The game speaks for itself and I recommend that you check it out.
What’s the point?
Their Wiki provides some useful details as to the purpose of this project, but as it is geared more towards researchers than the general public, it could probably use some translation of its own. Here’s my attempt at translation:
The ability to make progress in Computational Linguistics depends on the availability of large annotated corpora…
Basically, the goal of Computational Linguistics (and the related field, Natural Language Processing) is to come up with computer algorithms that can “parse” text — break it up into its component parts and explain how those parts relate to one another. This is like a very sophisticated version of the sentence diagramming you probably did in middle school.
Developing and testing new algorithms requires a log of practice materials (“corpora”). Most importantly, you need to know what the correct parse (sentence diagram) is for each of your practice sentences. In other words, you need “annotated corpora.”
…but creating such corpora by hand annotation is very expensive and time consuming; in practice, it is unfeasible to think of annotating more that one million words.
One million words may seem like a lot, but it isn’t really. One of the complaints about one of the most famous word frequency corpora (the venerable Francis & Kucera) is that many important words never even appear in it. If you take a random set of 1,000,000 words, very common words like a, and, and the take up a fair chunk of that set.
Also, consider that when a child learns a language, that child hears or reads many, many millions of words. If it takes so many for a human who is genetically programmed to learn language, how long should it take a computer algorithm? (Computers are more advanced than humans in many areas, but in the basic areas of human competency — vision, language, etc. — they are still shockingly primitive.)
However, the success of Wikipedia and other projects shows that another approach might be possible: take advantage of the willingness of Web users to collaborate in resource creation. AnaWiki is a recently started project htat iwll develop tools to allow and encourage large numbers of volunteers over the Web to collaborate in the creation of semantically annotated corpora (in the first instance, of a corpus annotated with information about anaphora).
This is, of course, what makes the Web so exciting. It took a number of years for it to become clear that the Web was not just a method of doing the same things we always did but faster and more cheaply, but actually a platform for doing things that had never even been considered before. It has had a deep impact in many areas of life — cognitive science research being just one.