![](econ-ling.cfm_files/spacer_003.gif)
![](econ-ling.cfm_files/ge.gif)
![](econ-ling.cfm_files/spacer_003.gif)
![](econ-ling.cfm_files/lib.gif)
![](econ-ling.cfm_files/spacer_003.gif)
![](econ-ling.cfm_files/pedition.gif)
![](econ-ling.cfm_files/spacer_003.gif)
![](econ-ling.cfm_files/services.gif)
![](econ-ling.cfm_files/spacer_003.gif)
![](econ-ling.cfm_files/spacer_003.gif) ![](econ-ling.cfm_files/spacer_003.gif)
![Classifieds](econ-ling.cfm_files/cfieds.gif)
![](econ-ling.cfm_files/spacer_003.gif) |
![](econ-ling.cfm_files/spacer_003.gif) |
Business education, recruitment, business and personal: click here
![](econ-ling.cfm_files/spacer_003.gif)
|
![](econ-ling.cfm_files/spacer_003.gif) |
![](econ-ling.cfm_files/spacer_003.gif) |
![](econ-ling.cfm_files/spacer_003.gif) |
![](econ-ling.cfm_files/spacer_003.gif)
![](econ-ling.cfm_files/aus.gif)
![](econ-ling.cfm_files/spacer_003.gif)
![](econ-ling.cfm_files/staffpages.gif)
![](econ-ling.cfm_files/spacer_003.gif)
![](econ-ling.cfm_files/spacer_003.gif)
|
![](econ-ling.cfm_files/spacer_003.gif) |
![](econ-ling.cfm_files/gray.gif) |
![](econ-ling.cfm_files/spacer_003.gif) |
Linguistics
Corpus colossal
Jan 20th 2005
From The Economist print edition
How well does the world wide web represent human language?
LINGUISTS
must often correct lay people's misconceptions of what they do. Their
job is not to be experts in “correct” grammar, ready at any moment to
smack your wrist for a split infinitive. What they seek are the
underlying rules of how language works in the minds and mouths of its
users. In the common shorthand, linguistics is descriptive, not
prescriptive. What actually sounds right and wrong to people, what they
actually write and say, is the linguist's raw material.
But that raw
material is surprisingly elusive. Getting people to speak naturally in
a controlled study is hard. Eavesdropping is difficult, time-consuming
and invasive of privacy. For these reasons, linguists often rely on a
“corpus” of language, a body of recorded speech and writing, nowadays
usually computerised. But traditional corpora have their disadvantages
too. The British National Corpus contains 100m words, of which 10m are
speech and 90m writing. But it represents only British English, and
100m words is not so many when linguists search for rare usages. Other
corpora, such as the North American News Text Corpus, are bigger, but
contain only formal writing and speech.
Linguists,
however, are slowly coming to discover the joys of a free and
searchable corpus of maybe 10 trillion words that is available to
anyone with an internet connection: the world wide web. The trend,
predictably enough, is prevalent on the internet itself. For example, a
group of linguists write informally on a weblog called Language Log.
There, they use Google to discuss the frequency of non-standard usages
such as “far from” as an adverb (“He far from succeeded”), as opposed
to more standard usages such as “He didn't succeed—far from it”. A
search of the blog itself shows that 354 Language Log pages use the
word “Google”. The blog's authors clearly rely heavily on it.
For several
reasons, though, researchers are wary about using the web in more
formal research. One, as Mark Liberman, a Language Log contributor,
warns colleagues, is that “there are some mean texts out there”. The
web is filled with words intended to attract internet searches to
gambling and pornography sites, and these can muck up linguists'
results. Originally, such sites would contain these words as lists, so
the makers of Google, the biggest search engine, fitted their product
with a list filter that would exclude hits without a correct
syntactical context. In response, as Dr Liberman notes, many offending
websites have hired computational linguists to churn out syntactically
correct but meaningless verbiage including common search terms. “When
some sandbank over a superslots hibernates, a directness toward a
progressive jackpot earns frequent flier miles” is a typical example.
Such pages are not filtered by Google, and thus create noise in
research data.
There are
other problems as well. Search engines, unlike the tools linguists use
to analyse standard corpora, do not allow searching for a particular
linguistic structure, such as “[Noun phrase] far from [verb phrase]”.
This requires indirect searching via samples like “He far from
succeeded”. But Philip Resnik, of the University of Maryland, has
created a “Linguist's Search Engine” (LSE)
to overcome this. When trying to answer, for example, whether a certain
kind of verb is generally used with a direct object, the LSE grabs a chunk of web pages (say a thousand, with perhaps a million words) that each include an example of the verb. The LSE
then parses the sample, allowing the linguist to find examples of a
given structure, such as the verb without an object. In short, the LSE allows a user to create and analyse a custom-made corpus within minutes.
The web still
has its drawbacks. Most of it is in English, limiting its use for other
languages (although Dr Resnik is working on a Chinese version of the LSE).
And it is mostly written, not spoken, making it tougher to gauge
people's spontaneous use. But since much web content is written by
non-professional writers, it more clearly represents informal and
spoken English than a corpus such as the North American News Text
Corpus does.
Despite the
problems, linguists are gradually warming to the web as a corpus for
formal research. An early paper on the subject, written in 2003 by
Frank Keller and Mirella Lapata, of Edinburgh and Sheffield
Universities, showed that web searches for rare two-word phrases
correlated well with the frequency found in traditional corpora, as
well as with human judgments of whether those phrases were natural.
What problems the web throws up are seemingly outweighed by the
advantages of its huge size. Such evidence, along with tools such as Dr
Resnik's, should convince more and more linguists to turn to the corpus
on their desktop. Young scholars seem particularly keen.
The easy
availability of the web also serves another purpose: to democratise the
way linguists work. Allowing anyone to conduct his own impromptu
linguistic research, some linguists hope, will do more to popularise
their notion of studying the intricacy and charm of language as it
really exists, not as killjoy prescriptivists think it should be.
|
![](econ-ling.cfm_files/spacer_003.gif) |
![](econ-ling.cfm_files/gray.gif) |
|
|