Scientists to build machine translation system for obscure languages

A team of computer scientists at Johns Hopkins University has won a $10.7 million grant from the Office of the Director of National Intelligence to create an information retrieval and translation system for languages that are not widely used around the world.

Philipp Koehn, a computer science professor in JHU’s Whiting School of Engineering, is leading a group of 20 professors, research scientists, post-doctoral fellows, and doctoral students in an effort to build a system that can respond to inquiries typed in English based on documents written in so-called “low resource” languages, which means there is relatively little written material in these languages.

“The biggest challenge we’re going to have with this setup is there’s not much data,” said Koehn, who has been researching machine language translation for nearly 20 years and wrote the textbook, Statistical Machine Translation. He is affiliated with the Whiting School’s Center for Language and Speech Processing.

Koehn said he expected that in a few weeks the DNI would send his group information on a specific language they can use to test the technology they’ve built for the task. He said that ultimately the intelligence agency is likely to choose languages for the project that may be spoken by millions of people, but not prevalent in written material, such as Kurdish, Serbo-Croatian, Khmer, Hmong, and Somali.

The DNI project starts with data. The scientists will compile online samples of the target language that have already been translated into English—about enough text to fill 10 books of 350-pages each—and begin machine analysis of language patterns. That would include sentence structure and the positions of verbs, adjectives, and other components.

Using that analysis, rather than the work of a human translator, the scientists develop algorithms that automatically translate the target language.

The system will be designed to respond to queries that include a word or term and a topic area or “domain.” These queries could include a term such as “zika virus” in the topic of “government” or in the topic of “health,” as it’s described on the DNI website. The responses produced by the translation system should tell the user how the material is relevant to the query.

The intelligence agency is launching the effort to explore how such a system might work, as intelligence gathering and analysis has come to encompass ever more languages. For most languages, the agency site says, “there are very few or no automated tools available for information retrieval, or machine translation.”

The project is meant to sharply cut the time and the amount of information needed to put a translation system into use for intelligence agents, the agency says.

At this stage, the program is exploring how these systems can work, and will be set up as a competition among several research institutions: Johns Hopkins; the University of Southern California; Columbia University; and Raytheon BBN Technologies, a technology, research, and development company.

Koehn said the agency is likely to turn the results of this research over to a private company to build a system that would be used by the government.

The project starts this month and will run in three phases over the course of four years. Koehn said the government can discontinue the work at the end of the first two phases of 19 months and 17 months, respectively. The final stage is scheduled to run 14 months.

Related