Cameo of an Information Scientist: Gerard Salton and Information Retrieval

Most of the published research in our field is probably not worth doing and ought to be forgotten
Gerard Salton, "A Note About Information Science Research," JASIS, 36(4), p.268, 1985

Indexing As A Craft

An indexer was a person who worked with an author and publisher to create a listing of topics called an index. The most common example is the back-of-the-book index.

"Indexing, far from being merely the alphabetically arranged concatenation of text words, is, first of all, a creative process. Though always, by definition, dependent on the work of another creative mind--that of a writer, artist, or compiler of data--it transforms the original text into a homologous but functionally different structure. In that structure, topics and names mentioned here and there in the text are arranged in an ordered and easily scannable sequence, indicating their place in the text. At the same time, both their explicit and implicit relationships are displayed in a way in which the original text does not or cannot reveal them."

"Like other creative processes, such as literary writing, painting, or composing music, indexing relies on certain technical rules which can be learned by example, training, and experience. Yet the creative inspiration that will result in a good and useful index can neither be learned nor taught."

Hans H. Wellisch Indexing from A to Z, 1995

Question: What Happens When Text Is Digitized?

Answer: Machines Can Find "Words"

An orthographic word is a series of alphanumerics between spaces. Are you as clever at finding words as a computer is? How many words in each of these?

dog house
doghouse
dog-house

Word, a slippery concept...

Linguists prefer to use "lexeme" to denote whatever a word may be. "Lexeme - a word in the abstract sense, an individual distinct item of vocabulary, of which a number of actual forms may exist for use in different syntactic roles." Linguists use "word form" to avoid the ambiguity of word. For example, see, sees, seeing, saw and seen are word forms of the lexeme see. The Oxford Dictionary of English Grammar, 1998

Ya, so what? Answer: Language is messy, organic and not a system of notation planned ahead of time.

"To understand punctuation, a historical perspective is essential. The modern system is the result of a process of change over many centuries, affecting both the shapes and uses of punctuation marks. Early classical texts were unpunctuated, with no spaces between words." David Crystal, The Cambridge Encyclopedia of the English Language, 1995

Ya, so what? Answer: What you consider to be "normal" form for a language was invented by some printer in London in the Fifteen century who put spaces between words because that's the way he felt that afternoon, etc. Ok! ok! It was the Fourteen century and he woke up with a hang over, etc.

Question: What Happens When Text is Digitized?

Answer: Machines Can Find Words

Therefore: Machines Can Index Text

Which Means: Machines Can Determine Meaning

Significance Check for the Inattentive! There is a tremendous amount of money to be made if a machine can determine the meaning of text! Think of all the newspapers that can be indexed, all the scientific articles that can be indexed, now spam can be parsed from e-mail, porn can be found on the web, etc., etc. You could build a bot to scour the Net that would determine the "meaning" of Web pages and thereby find just the information that you're looking for! Call that the Semantic Web! Terry Brooks could get a machine to answer his e-mail while he goes fishing!

Something to read: "Orthography as a fundamental impediment to online information retrieval" by Terrence A. Brooks [Note: Some material in this essay has been omitted for brevity. You don't need the stuff that has been omitted.]

Are INFO 100 Students Clever Enough To Write A Program To Find Words?

JavaScript, like most scripting and programming languages, has string manipulation tools. One of those tools splits a string based on some indicated delimiter. In the following, a string is split on the space character.

<script language=JavaScript>
	var s = new String("dog house");
	var wordArray = new Array();
	wordArray = s.split(" ");
	for (i = 0; i < wordArray.length; i ++) {
	  document.write("<p>" + wordArray[i] + "</p>");
	}
</script>

Here's a more sophisticated example. Enter several words in the text box and press the button.

<script language=JavaScript>

function doSplitter() {
  var inString = new String();
  inString = document.myForm.inComing.value;
  
    var lengthTest = inString.length;
	if (lengthTest == 0) {
	 	alert("Enter a value");
	} else {
		
			if (lengthTest > 30){
				inString = inString.substring(0, 30);
			}
			
		  var wordArray = new Array();
		  wordArray = inString.split(" ");
			  
		  alert("There are " + wordArray.length + " words");
   }
}
</script>


<form name="myForm">
<input type=text name="inComing" size=30>
</form>
<br>
<input type=button value="Count The Words!" onclick="doSplitter()">

Gerard Salton and Automatic Indexing

Prior to Gerry Salton, information retrieval was done with paper methods and library practices. Salton developed information retrieval for digital documents in a database context. A key innovation was automatic indexing - using machine methods to determine the meaning of text.
He was awarded the first ACM SIGIR Award for outstanding contributions to information retrieval, twice received the ASIS award for Best Information Science Book, and the ASIS Award of Merit, The Alexander von Humboldt Senior Scientist Award, and was recognized as a fellow of the American Association for the Advancement of Science.

ACM - Association for Computing Machinery SIGIR - Special Interest Group in Information Retrieval ASIS - American Society for Information Science

A Simple Automatic Indexing Process

In 1983, Salton and McGill published Introduction to Modern Information Retrieval, which outlined a simple automatic indexing process (page 71, passim):

Identify the individual words that constitute the document

Eliminate high frequency words (i.e., stopwords)

Stem the remaining words

Weight the index terms by their frequency of occurrence in documents.

Politics, Religion, Food Fights, etc., in Information Retrieval

Information retrieval is quite controversial. It attempts to index the meaning of documents via a mechanical processing of language. It is easy to be skeptical (Is it your impression that Terry Brooks is a skeptic?), but something has to be done with the torrent of digital documents, or else we'll have to invent Google (yikes!).

Some folks have tried linguistic methods such as natural language processing and other folks have tried fuzzy logic. Believe me, if something really worked somebody would be very famous and wealthy.

The philosopher Wittgenstein considered language as a sort of word game, so who knows what words are, or what language is, or what this sentence is about, or stuff like that. What would happen if we all started speaking leetspeek?

You can imagine that great religion battles have been fought over the "right" way to do information retrieval. Consider the following letter written by Gerry Salton:

Journal of the American Society for Information Science, v.47(4), April 1996, p. 333

The Challenge of Information Retrieval On The Web

The web is a vast information resource that permits us to do information retrieval (Really? Aren't you making a lot of assumptions? Such as? Well, that the web is a database). Using Google, we retrieve web documents (Really? Terry Brooks said that web pages were 'presentations' and not 'documents'). Google indexes the web using keywords that people put in their HTML pages (Really? Doesn't Google consider them as spam?) We've learned a lot in the modern database era that applies to the web. (Really? Such as? What about finding words in web pages? So, what's a word? Answer: stuff between spaces. You mean the filling between spaces? Ok, the filling. So Google is stuffed with filling? That's a lot better than being filled with stuffing!)

Lots of things to think about:

How are web pages (i.e., static HTML) and database documents the same? different?
How are web pages (i.e., dynamic HTML) and database documents the same? different?
Is Google an index of the Web?
Can you build an automatic indexing algorithm for the Web?
If Google uses an automatic indexing algorithm, why is it a secret?
Why is packing a bunch of words in a web page considered spamming Google?
Why is responding to Google web crawler with a special page called "cloaking" and punished by being banned from Google? Weren't you just trying to "help" the crawler?

Why don't we apply everything we've learned during the last 50 years of information retrieval to the web?

The ironic, market-driven, cynical last comment: You realize that if there was a technological "silver bullet" that solved the information retrieval problem on the web, an organization like Microsoft would have already incorporated it into Internet Explorer. If the solution really existed, you've have to pay to use it.

Something to read: Websearch: How the web has changed information retrieval, by Terrence A. Brooks

An Even Deeper Insight (That Might Touch Profundity)

If there was a consistent pattern in spoken or written language, it would have been noticed and exploited long before computers came along.

There hasn't been, so one would tend to conclude there isn't.

Oops! Sorry. Please excuse my candor.

Cynicism is Easy, But It Doesn't Solve Real-World Problems!

Parsing Text is Imperfect, But So What!

"Brightmail scans the text of every message sent to a Hotmail account and runs it through a filter that looks for patterns of words, letters or numbers. If it finds such patterns, which are known as signatures, this electronic sieve either deletes the message that contains them or strains it into a separate folder."
"Brightmail augments the filter by maintaining 250,000 e-mail accounts as spam traps. Staff members harvest any spam messages that evade the filters, then generate rules to recognize those messages. The Hotmail filter is upgraded every 10 minutes to incorporate the new rules." Fortifying the In Box as Spammers Lay Siege by Mark Glassman, New York Times, July 31, 2003

"When serving ads for content sites, both Overture and Google employ technology that infers the topic of a page by scanning for words and phrases, searching through a database of tens of thousands of advertisers, then delivering a relevant text ad. In some cases that is not difficult...The technology is not yet foolproof. The online edition of The New York Post ran an article last month about a murder in which the victim's body parts were packed in a suitcase, and Google served up an ad for a luggage dealer." If You Liked The Web Page, Try The Ad" by Bob Tedeschi, New York Times, Monday, August 4, 2003

Examples of spam currently slipping past my filter:

Subject: yo)u sh:ould a)ttend thi,s l-ife ch_anging e)vent u:cpn
Subject: Ch'eck ou(t ou)r selec;tion _of :great R^X fi/bk