Grading on a curve? Why AI systems test brilliantly but stumble in real life

The headline in early 2018 was a shocker: “Robots are better at reading than humans.” Two artificial intelligence systems, one from Microsoft and the other from Alibaba, had scored slightly higher than humans on Stanford’s widely used test of reading comprehension.

The test scores were real, but the conclusion was wrong. As Robin Jia and Percy Liang of Stanford showed a few months later, the “robots” were only better than humans at taking that specific test. Why? Because they had trained themselves on readings that were similar to those on the test.

When the researchers added an extraneous but confusing sentence to each reading, the AI systems got tricked time after time and scored lower. By contrast, the humans ignored the red herrings and did just as well as before.

To Christopher Potts, a professor of linguistics and Stanford HAI faculty member who specializes in natural language processing for AI systems, that crystallized one of the biggest challenges in separating hype from reality about AI capabilities.

Put simply: AI systems are incredibly good at learning to take tests, but they still lack cognitive skills that humans use to navigate in the real world. AI systems are like high school students who prep for the SAT by practicing on old tests, but the computers take thousands of old tests and can do it in a matter of hours. When faced with less predictable challenges, though, they are often flummoxed.

“How that plays out for the public is that you get systems that perform fantastically well on tests but make all kinds of obvious mistakes in the real world,” says Potts. “That’s because there’s no guarantee in the real world that the new examples will come out of the same kind of data that the systems were trained on. They have to deal with whatever the world throws at them.”

Part of the solution, Potts says, is to embrace “adversarial testing” that is deliberately designed to be confusing and unfamiliar to the AI systems. In reading comprehension, that could mean adding misleading, ungrammatical, or nonsensical sentences to a passage. It could mean switching from a vocabulary used in painting to one used in music. In voice recognition, it could mean using regional accents and colloquialisms.

The immediate goal is to get a more accurate and realistic measure of a system’s performance. The standard approaches to AI testing, says Potts, are “too generous.” The deeper goal, he says, is to push systems to learn some of the skills that humans use to grapple with unfamiliar problems. It’s also to have systems develop some level of self-awareness, especially about their own limitations.

“There is something superficial in the way the systems are learning,” Potts says. “They’re picking up on idiosyncratic associations and patterns in the data, but those patterns can mislead them.”

In reading comprehension, for example, AI systems rely heavily on the proximity of words to each other. A system that reads a passage about Christmas might well be able to answer “Santa Claus” when asked for another name for “Father Christmas.” But it could get confused if the passage says “Father Christmas, who is not the Easter Bunny, is also known as Santa Claus.” For humans, the Easter Bunny reference is a minor distraction. For AIs, says Potts, it can radically change their predictions of the right answer.

Rethinking Measurement

To properly measure the progress in artificial intelligence, Potts argues, we should be looking at three big questions.

First, can a system display “systematicity” and think beyond the details of each specific situation? Can it learn concepts and cognitive skills that it puts to general use?

A human who understands “Sandy loves Kim,” Potts says, will immediately understand the sentence “Kim loves Sandy” as well as “the puppy loves Sandy” and “Sandy loves the puppy.” Yet AI systems can easily get one of those sentences right and another wrong. This kind of systematicity has long been regarded as a hallmark of human cognition, in work stretching back to the early days of AI.

“This is the way humans take smaller and simpler [cognitive] capabilities and combine them in novel ways to do more complex things,” says Potts. “It’s a key to our ability to be creative with a finite number of individual capabilities. Strikingly, however, many systems in natural language processing that perform well in standard evaluation mode fail these kinds of systematicity tests.”

A second big question, Potts says, is whether systems can know what they don’t know. Can a system be “introspective” enough to recognize that it needs more information before it attempts to answer a question? Can it figure out what to ask for?

“Right now, these systems will give you an answer even if they have very low confidence,” Potts says. “The easy solution is to set some kind of threshold, so that a system is programmed to not answer a question if its confidence is below that threshold. But that doesn’t feel especially sophisticated or introspective.”

Real progress, Potts says, would be if the computer could recognize the information it lacks and ask for it. “At the behavior level, I want a system that’s not just hard-wired as a question-in/answer-out device, but rather one that is doing the human thing of recognizing goals and understanding its own limitations. I’d like it to indicate that it needs more facts or that it needs to clarify ambiguous words. That’s what humans do.”

A third big question, says Potts, may seem obvious but hasn’t been: Is an AI system actually making people happier or more productive?

At the moment, AI systems are measured mainly through automated evaluations — sometimes thousands of them per day — of how well they perform in “labeling” data in a dataset.

“We need to recognize that those evaluations are just indirect proxies of what we were hoping to achieve. Nobody cares how well the system labels data on an already-labeled test set. The whole name of the game is to develop systems that allow people to achieve more than they could otherwise.”

Tempering Expectations

For all his skepticism, Potts says it’s important to remember that artificial intelligence has made astounding progress in everything from speech recognition and self-driving cars to medical diagnostics.

“We live in a golden age for AI, in the sense that we now have systems doing things that we would have said were science fiction 15 years ago,” he says. “But there is a more skeptical view within the natural language processing community about how much of this is really a breakthrough, and the wider world may not have gotten that message yet.”

Rethinking Measurement

Tempering Expectations

Related