As part of an effort to anticipate — and thwart — the plans of potential terrorists, the Federal Aviation Administration is supporting the development of a new search engine by University at Buffalo researchers that is designed to detect “hidden” information that can be gleaned from public Web sites.
Once the technique is developed and validated, it has the potential to make the Web searches that the public performs daily far more effective in locating meaningful information on the Internet.
The UB team recently completed an initial prototype system, designed explicitly to enable searches for “hidden” information within the 9/11 Commission Report.
The system permits users to find the best trail of evidence through many documents that connects two or more apparently unrelated concepts.
Funded by the FAA, as well as by the National Science Foundation specifically for anti-terrorism applications, the UB project is based on Unintended Information Revelation, or UIR, a search technique designed to uncover hidden information.
The premise of UIR is that pieces of information that by themselves appear to be innocent may be linked together to reveal inadvertently highly sensitive data.
The need for such a tool arose after 9/11 when the FAA started focusing on information being disseminated on its Web site.
“It couldn’t tell if it was possible to infer things that the FAA doesn’t want others to infer by putting together data from this page and that page and that page,” said Rohini Srihari, Ph.D., UB professor of computer science and engineering, who is developing the new search engine with her colleagues in the Center of Excellence in Document Analysis and Recognition in the UB School of Engineering and Applied Sciences.
Existing search engines process individual documents based on the number of times a key word appears in a single document, she explained.
In contrast, UIR is based on the construction of concept chain graphs that search for the best path connecting two concepts within a multitude of documents.
“A concept chain graph will show you what’s common between two seemingly unconnected things,” Srihari said.
The UIR is designed to detect automatically the “hidden” revelation of sensitive information.
At the same time, Srihari’s NSF research is geared toward developing the core algorithms that expose hidden paths in trails of numerous documents that may have been generated by different individuals or organizations.
While a single Web site or document may not reveal malicious intentions, a concept chain graph may reveal such intentions “hidden” among numerous documents.
“With regular searches, the input is a set of key words,” Srihari explained. “The search produces a ranked list of documents, any one of which could satisfy the query.
“UIR, on the other hand, is a composite query, not a keyword query. It is designed to find the best path, the best chain of associations between two or more ideas. It returns to you an evidence trail that says ‘This is how these pieces are connected.'”
To develop the method, the UB researchers used the chapters of the 9/11 Commission Report to establish concept ontologies — lists of terms of interest in the specific domains relevant to the researchers: aviation, security and anti-terrorism issues.
According to Srihari, the key was coming up with a sophisticated content representation method for processing, or mining, text.
“UIR is an example of text mining, going across documents and uncovering things that are not apparent to the user,” she said.
One search the UB researchers used to test their prototype involved exploring the chapters in the 9/11 Commission Report for connections between the three terms that they knew had a connection: “Hamburg,” “San Diego” and “imam” (a Muslim leader).
Srihari explained that the model generated by the system on the basis of the 9/11 corpus found that terrorists Binal Shibh and Mohamed Atta shared apartments in Hamburg, Germany; Atta and Nawaf al Hazmi were hijackers involved in the 9/11 attacks and Hazmi found an apartment in San Diego with the help of Anwar Aulaq, an imam named at a mosque in San Diego.
“The concept chains show you what may be of interest, but the real intelligence here is gleaned from looking for patterns of interest,” said Srihari. “Once a pattern of interest is identified, then you can ask, ‘Are there more patterns like this?'”
A more robust prototype is expected to be delivered to FAA for evaluation by the end of the year.
Eventually, the UB search tool may also be used for other applications, such as helping biomedical researchers conduct more effective investigations into the connections between genes, proteins and disease.
Sudarshan Lamkhede, Anmol Bhasin and Wei Dai, graduate students, in the UB Department of Computer Science and Engineering, and Nick Schwartzmeyer, a graduate student in the Department of Linguistics in the College of Arts and Sciences, are working with Srihari on the project.
The University at Buffalo is a premier research-intensive public university, the largest and most comprehensive campus in the State University of New York.