GoogleScience

A Google search can help you find cutting-edge research. A Google search can also be cutting-edge research.

Many questions in linguistics (the formal study of language) and psycholinguistics (the study of language as human behavior) are answered by turning to a corpus. A corpus is a large selection of texts and/or transcripts. Just a few years ago, they were difficult and expensive to create. Arguably the most popular word frequency corpus in English — the verable Brown Corpus — was based off of one million words of text. One million words sounds like a lot, but so many of those are “the” and “of” that in fact many words do not appear in the corpus at all.

The Google corpus contains billions of pages of text.

So what does one do with a corpus? One obvious thing is to figure out which words are more common than others. The most common words in English are short function words like “a” (found on 4.95 billion web pages) and “of” (3.61 billion pages). Google, of course, doesn’t tell you how many times a words appears, but only on how many pages it appears…which may actually be an advantage, since we’re typically more interested in words that show up on many websites than a word that shows up many, many times on just one website.

You might be interested in what the most common noun is. Is it “time,” “man,” “city,” “boy,” or “Internet?” You might check to see whether verbs or nouns are more common in English by comparing a large sample of verbs and nouns. You can also compare across languages. Children learn verbs more slowly than nouns in English but not in Chinese. Is that because verbs are more common in Chinese than English?

OK, those are fun experiments, but none of them sound very cutting-edge. If you want to see the Google corpus in action, check out Language Log. The writers there regularly turn to the Google corpus to answer their questions. Google is probably less-commonly used in more formal contexts, but the PsychInfo database turned up 76 hits for “Google.” Many were studies about how people use Google, but some were specifically using the Google corpus, such as “Building a customised Google-based collocation collector to enhance language learning,” by Shesen Guo and Ganzhou Zhang. Another — “Nine psychologists: mapping the collective mind with Google” by Jack Arnold — looked at the organization of conceptual knowledge. At a recent conference, I saw a presentation by vision scientists using Google Image to explore the organization of visual memory. I expect to see more and more of this type of research in the near future.

Of course, there’s nothing specific to Google about this. It’s just what everybody seems to use.

Quick Note Before You Read On.

ScienceBlog.com has no paywalls, no sponsored content, and no agenda beyond getting the science right. Every story here is written to inform, not to impress an advertiser or push a point of view.

Good science journalism takes time — reading the papers, checking the claims, finding researchers who can put findings in context. We do that work because we think it matters.

If you find this site useful, consider supporting it with a donation. Even a few dollars a month helps keep the coverage independent and free for everyone.

GoogleScience

Related

Leave a Comment Cancel reply