I like to believe we are running some of the more innovative experiments on the Web. Occasionally, however, I run across true masters of the form.
One of my current favorites is Games With a Purpose
. The avowed purpose of the site is to “make computers smarter.” From what I can tell, they are mainly focusing on labeling. As I have written before
, we often think that children learn words like “dog” by seeing a dog while their mother or some other adult says “dog.” It turns out that this kind of learning is actually very, very difficult.
In fact, it has been shown that a very smart baby who used impeccable logic — that is, a baby who was like a well-programmed computer — would in fact never learn any words at all. We must have some innate biases that allow us to learn words.
This brings us to the issue of GoogleImage. Google would like to help us find pictures on the web. But we search by typing in a phrase. To make this work, then, Google needs to figure out what words would best describe any given picture. As I just mentioned above, this is something humans do well but at which computers are hopeless.
Unfortunately, neither Google nor GWAP.com have much information that I can find about how exactly this information is used. One obvious possibility is that the labels they derive from the players go directly into the Google database. Another possibility is that the researchers are using the information derived in these games to create data sets with which they can train computers using learning algorithms.
So what do I think is cool about this stuff? It’s an application of very simple Web technology to do something that is difficult or impossible to do without the Internet.
It should hopefully be clear that the problems being solved here require massive amounts of labor. Labeling all the images on the Internet is way beyond the capabilities of even the most dedicated engineer. However, if everybody in the world labeled a handful, that would be tens or hundreds of billions of labels. Similarly, even if one is not trying to label all the images on the web and only trying to provide training data for a computer model, such models require massive data sets.