GoogleScience

A Google search can help you find cutting-edge research. A Google search can also be cutting-edge research.

Many questions in linguistics (the formal study of language) and psycholinguistics (the study of language as human behavior) are answered by turning to a corpus. A corpus is a large selection of texts and/or transcripts. Just a few years ago, they were difficult and expensive to create. Arguably the most popular word frequency corpus in English — the verable Brown Corpus — was based off of one million words of text. One million words sounds like a lot, but so many of those are “the” and “of” that in fact many words do not appear in the corpus at all.

The Google corpus contains billions of pages of text.

So what does one do with a corpus? One obvious thing is to figure out which words are more common than others. The most common words in English are short function words like “a” (found on 4.95 billion web pages) and “of” (3.61 billion pages). Google, of course, doesn’t tell you how many times a words appears, but only on how many pages it appears…which may actually be an advantage, since we’re typically more interested in words that show up on many websites than a word that shows up many, many times on just one website.

You might be interested in what the most common noun is. Is it “time,” “man,” “city,” “boy,” or “Internet?” You might check to see whether verbs or nouns are more common in English by comparing a large sample of verbs and nouns. You can also compare across languages. Children learn verbs more slowly than nouns in English but not in Chinese. Is that because verbs are more common in Chinese than English?

OK, those are fun experiments, but none of them sound very cutting-edge. If you want to see the Google corpus in action, check out Language Log. The writers there regularly turn to the Google corpus to answer their questions. Google is probably less-commonly used in more formal contexts, but the PsychInfo database turned up 76 hits for “Google.” Many were studies about how people use Google, but some were specifically using the Google corpus, such as “Building a customised Google-based collocation collector to enhance language learning,” by Shesen Guo and Ganzhou Zhang. Another — “Nine psychologists: mapping the collective mind with Google” by Jack Arnold — looked at the organization of conceptual knowledge. At a recent conference, I saw a presentation by vision scientists using Google Image to explore the organization of visual memory. I expect to see more and more of this type of research in the near future.

Of course, there’s nothing specific to Google about this. It’s just what everybody seems to use.

October 4, 2007

9 Responses to GoogleScience

  1. Anonymous April 9, 2008 at 12:16 pm #

    Google is providing numerous tools like the Google Corps to the online community and that is an indication that it is coming out its image of a search engine company. Google’s Success as a great technology company is dependent on product designs which are optimized by user experience.

    As noted by some of the linguistic experts the product – Google Corps – may not be as per their expectations, but any innovation starts like that. Instead of just complaining about the product, how many linguistic experts are willing to team up with Google to build the next generation Google Corps?

  2. Anonymous October 6, 2007 at 2:11 pm #

    Although there is some minor attempt to understand questions, Google is not a question answering system. The best it can do for you is to take a bunch of keywords and tell you which documents match those keywords then display the list to you in an order which sorted by the “popularity” of the websites providing the matches.

    True question answering systems do a lot more than this and are correspondingly SLOWER. Although brainboost.com gives a nice quick response to real questions. Try languagecomputer.com for a different question answering system. Ask.com tries to answer questions, but I haven’t seen it be very good. I like google better than ask even though google doesn’t really try to answer questions.

    We’re still a long way for high functioning question answering systems — and even if we weren’t, there are lot of reasonable questions for which there is no document that actually provides an answer.

  3. Anonymous October 6, 2007 at 8:53 am #

    You propose some thoughtful ideas but I disagree with your specifics. I think that what you really want, and don’t we all, is accuracy of information and the reliabilty of finding that information quickly. What you are wanting is for the first 20-30 sites of the millions that are out there to be the first ones presented. Even better than this is if these 30 would provide contrasting opinions with equal validity and sources of data, providing you an efficient way to weigh alternatives and make an informed decision. This calls for a more specific type of search engine. Some services already exist to provide this more detailed search but more are needed. Expecting a general engine to provide this precisely accurate information is unrealistic. It is akin to expecting Wikipedia to provide scientifically backed entries.

    I’ll give kudos to Google for providing a remarkable engine. In the early days of the web it was significantly harder to find information of merit. Google’s statistical feedback has helped to bring some of those 30 articles to the top of the list. We do need to prevent ourselves from just accepting those top choices as the robust data becomes more easily located. But I also believe that with time, general search engines, whether or not it is Google, will be able to provide accurate timely results with minimal search parameter.

  4. coglanglab October 5, 2007 at 6:07 am #

    OK, I just read the first of those Veronis articles about the Google index. It was some interesting research, but I think he came to the wrong conclusions.

    For those who haven’t read it, he is puzzled by the fact that sometimes if you search for A or B you get fewer counts than you do if you search for just A. He develops the hypothesis that this is because Google doesn’t index as many pages as it says it does.

    He then tests this hypothesis in an interesting way. He picks a list of English words and then searches for them in all the Web or just in English. He finds that the “just English” search finds only about half as many pages.

    I didn’t actually read the end of the post, because at this point I got distracted trying to repeat his results. You can repeat them, to an extent. If you look for “accumulated” in English only, you find half as many pages as on the entire Web. What happens if you look for “accumulated” in Japanese pages? You find huge number of examples! (You also find a decent number on Russian pages — the only other language I tried.)

    What’s going on? Do those Japanese pages really have the word “accumulated” on them? Yes, they do. You can look through the search results yourself. I haven’t tried every language or done the math, but I am willing to bet that those “missing” pages with “accumulated” will be accounted for in other languages.

    Is this strange? Not really. My experience in a range of other countries (Russia, Spain, Taiwan, China, Mexico, Japan) is that a great deal of English has been incorporated into the texts of other languages. While it is unusual to read an article in English and come across a long quote in a foreign language, that’s not so uncommon in some other countries. (If you want literary examples, long swaths of War and Peace were written in French.)

  5. coglanglab October 5, 2007 at 5:50 am #

    Some very good points were raised above. Search engine corpora are not a perfect source. There are some over-represented words. Also, you can’t tease apart different meanings of some words (like the “java” example above).

    But none of our other sources are perfect either. The gold-standard Brown Corpus was carefully compiled from a set of “representative” sources, but the truth is that in many projects I’ve done I’ve found the Brown Corpus too unreliable to use.

    So I definitely wouldn’t read my post as saying “Google solves all problems.” It doesn’t. But if we stopped doing science until we had the perfect materials, we’d still be waiting to do Newton’s experiments. You work with the best you have, and for some questions, Google is the best available option.

    You certainly can ask whether, even in those cases, Google gives you sufficiently reliable information as to be useful. I think it does, but I’m willing to hear arguments otherwise. I look forward to reading the articles linked to above.

    Please try my web-based experiments

  6. Robert A Cook PE October 5, 2007 at 9:07 pm #

    Well I’ve never found it to be in error – It certainly is more accurate (through its “spell-assumer-correcter”) than my atrocious typing.

    What it finds is websites using (in the correct format) the words you enter.

    Anything past that simple statement replies on the judgement & patience of the user. I would not rely on “counts” or “hits” or even number of websites to matter – bluntly, who cares?

    The only relevant answer is: Did the search engine (google or a library of congress search or a inter-library search of research papers) introduce you to authors who produce consistent and accurate answers to the question you want anwered? There is no “popularity” in scientific answers – only accuracy of results.

    I don’t care about how many results were returned – only about whether my question on ice or geomagnetism or star formation or atomic nuclei or gamma rays or biology & cells was answered. Accurately and without bias. On any one of the 25 or 30 thirty sites of the millions returned.

    We used to do searches by library cards one book at a time. Were those searches more accurate than having millions of documents (any 90% of which may be wrong!) available at the push of a mouse key?

  7. Anonymous October 4, 2007 at 1:02 pm #

    In the old days, people had questions and would say “I guess some day we’ll ask God when we get to heaven.” I joke now that the wait is over– questions? Just ask Google. Google has an opinion on anything that is a hot topic or debatable.

    …Bernie
    http://www.sciligion.org

  8. Anonymous October 4, 2007 at 11:22 am #

    Try this: Google “java”

    what you get is more than just not an island in the pacific or a species of coffee bean, what you get is good solid evidence that the billions of pages stuffed into google are tainted data representing a very narrow demographic. How could any meaningful language research be based on the usage by such a tiny cross-section? A billion pages all written by milk truck drivers from Minneapolis is hardly an indication applicable to the broader population.

    And then there’s the many pages written by bots. Try the Google “blog” function for any common term, most of the pages you hit are troll sites for casino ads (or worse). Many other terms encounter the word-salad pages trolling for keywords, also to lure clickthroughs into paid page-impressions. It would take a mighty fine bit of AI coding to create a word-sifting algorithm that could account for this noise!

  9. Anonymous October 4, 2007 at 9:53 am #

    I manage the linguistics department of a software company; we use corpora and webskimming in a multitude of languages on a daily basis. We also use google for a lot of minor research tasks. For a while we played with using web counts from both google and yahoo and found that they were just not reliable enough to provide useful data. Turns out there are all kinds of finicking things that the engines do that mess up the results (not the least of which is throwing away all the diacritics!).

    Jean Véronis of http://aixtal.blogspot.com/ has quite extensive commentary on the search engines and their counts which makes for some very interesting reading. Here are a couple of the posts:

    http://aixtal.blogspot.com/2005/02/web-googles-missing-pages-mystery.html

    http://aixtal.blogspot.com/2005/08/yahoo-19-billion-pages.html

    Myself, I’d be skeptical of any conclusions drawn from using “google as corpus” unless the questions being asked are not susceptible to the idiosyncrasies of the engine.