CSE 574 Notes from Class Discussion

Article	Ken	Lee	Meg	Nan
1	1	4	2	2
2	5	2	4	4
3			3
4	2	5		5
5	4	1		1
6		2	5

University of Washington
Department of Computer Science and Engineering

CSE574 – Notes from Class Discussion

Date	Topic	Scribe
1/8/98	Search Broker & Ahoy!	Tessa Lau
1/13/98	STIR, WHIRL & SPIRAL	Jason Staczek
1/15/98	Datalog, First-order Logic, and Description Logic examples	Marc Friedman
1/20/98	Information Integration Systems: Razor, TSIMMIS, IM	Rachel Pottinger
1/22/98	Wrapper Induction	Brian Michalowski
1/27/98	ILA and Shopbot	Steve Wolfman
1/29/98	Wrappers Continued	Adam Carlson
2/3/98	Strudel
2/5/98	Constraints and the Web	Dave Hsu
2/10/98	WebSQL	Derrick Tucker
2/19/98	Recommender Systems	Corin Anderson

(01/08/98) Search Broker; Ahoy!

Notes by Tessa Lau

We discussed two papers:

The Search Broker, by Udi Manber and Peter A. Bigot (Search Broker)
Dynamic Reference Sifting: A Case Study in the Homepage Domain, by Jonathan Shakes, Marc Langheinrich, and Oren Etzioni (Ahoy!)

Search Broker

The Search Broker provides a common interface to a number of diverse information databases. The databases are organized into a two-level hierarchy. Each query includes as its first word a topic selector, which maps into a single database that provides the best information for that query. SB then performs the following four steps:

Match topic to database
Translate query into format for search engine (db)
Each db has a template:
- List of perl regexps: first matching expression is used
- Post-processing: set dynamic defaults
Send HTTP request (GET or POST) to search engine
Present results to user
- Collate results from one or more search engines into single page

Strengths

Incorporation of human input
- Human input sorely needed on the web
- More reasonable scale for human input (unlike Yahoo!)
- Gives all those librarians a job :)
Access to the hidden web: dynamically-generated content
High accuracy: librarians select the "best" info sources
Large selection of databases covering many topics
Simple wrappers (this is also a limitation)
- 400 wrappers in under 300k
- 90% of these wrappers just use "keywords = (.*)"

Limitations

Rudimentary topic selection
- User must specify topic (no query routing)
- Only one database per topic (this could be either good or bad)
- Limited number of aliases
- Two topics could map to the same db (e.g., cd-by-title and cd-by-artist)
- Could use machine learning: on a topic miss, have the user select the correct topic and use this as training example
- Could use WordNet to enrich the topic-db mapping
Sensitivity to ontology or topic-db mapping
Could use Bayes networks to disambiguate topic based on the query content
Simple wrappers:
- it doesn't parse the results like Jango does
- queries and results are unstructured
Not parallel: search engines are queried one-by-one
No ranking (or clustering) of results. However, the librarian has already chosen the best search engine for each topic, so you could argue that ranking isn't necessary.
Resource limitations: a fixed two-level hierarchy might need to grow arbitrarily large as the web grows

Paper evaluation

Everyone agreed that the paper was lacking in analysis and evaluation of the system. In particular, there was no discussion of how well the system performed, nor was there justification for their design decisions. However, Dan was extremely interested in the description of the wrapper language.

Comments

Several comments arose comparing the Search Broker to Jango/Excite and regular search engines like Metacrawler and Alta Vista. I think the comparison is a bit unfair, since people have different kinds of information needs and each system addresses a different need.

Jango's architecture and use of complex wrappers allow it to support advanced queries and format the results into a table. This is great for comparison-type queries, where there are several "correct" answers to the query and the user is interested in a comparison of all of them.

The Search Broker, on the other hand, attempts to answer fact-queries where there is only one correct answer (e.g., how much fat is in pizza?). In this case comparisons aren't as useful and SB's approach of having only one database per topic is appropriate.

Ahoy!

The Ahoy! homepage finder is based on the idea of using domain dependent heuristics to maximize precision and recall by filtering the output of a generalized search engine with high recall. The architecture described in the paper is called Dynamic Reference Sifting, which has several components:

Reference source (not necessarily comprehensive)
Cross filter: filter pages by institution
Heuristic filter: test for "homepage-ness"
Buckets: rank pages based on correct name, correct institution, homepage-ness
URL generator: synthesizes candidate URLs when previous steps fail
URL pattern learner
- Given a successful example, extract the tuple [name, institution, URL]

What classes are appropriate for DRS?

Too large or dynamic for manual search
Members easily identifiable
Many can be found from a reference source
User only interested in a few good hits

Examples of such classes are:

homepages
popular articles
product reviews
price lists
transportation schedules
resumes
company home pages

System evaluation

Strengths: the system provides both high recall and high precision. Precision is especially important because it reduces the need to scan hundreds of false positives. It's able to bootstrap itself, which means that it's useful even before it has gathered any training examples. It is able to incorporate domain-dependent filters such as the institutional cross-filter and a test for homepage-ness.

Limitations: slow; must wait for all search engines to complete. DRS is not applicable in every domain.

Paper evaluation

Strengths: the experiments are comprehensive and (IMHO) convincing.

Limitations: the precision experiments aren't quite fair, since they only consider the top-ranked result rather than the first page of results. Also, the comparison of recall among the different systems is bogus since Ahoy!'s recall is bounded by Metacrawler's recall.

Ahoy! == DRS: it's also not clear whether DRS is general or whether it only applies to homepage-finding.

(01/13/98) STIR, WHIRL & SPIRAL

Notes by Jason Staczek

We discussed one paper, A Web-based Information System that Reasons with Structured Collections of Text, by William W. Cohen, AT&T Research. Dan opened with some background on relational database operations, vector representation of text documents, and document similarity calculations. A fair bit of the discussion was spent contrasting Nick Kushmerick’s HLRT wrappers (Wrapper Induction for Information Extraction, Kushmerick et al.) with Cohen’s conversion programs.

The paper does a poor job of describing the user experience. Instead try a guided tour of the WHIRL system at Cohen’s home page.

Relational DB background

A relation is a fixed width data table, with each row (known as a tuple) consisting of values for named fields. Basic operations on a single relation are selection of tuples and projection (removal of fields or columns). Given two relations P and Q, their cartesian product is defined as:

P =

A	B	C
1	2	3
5	5	4
2	3	4
4	3	5
4	3	3
1	2	3

Q =

D	E
3	1
4	2

P ´ Q =

A	B	C	D	E
1	2	3	3	1
5	5	4	4	2
2	3	4	3	1
4	3	5	4	2
4	3	3	3	1
1	2	3	4	2

Two (or more) relations may be joined by applying a predicate to their cartesian product to select only those tuples which satisfy the condition.

P	´	Q =
	C = D

A	B	C	D	E
1	2	3	3	1
5	5	4	4	2
2	3	4	4	2
4	3	3	3	1
1	2	3	3	1

Join notation may also be written in conjunctive form. The join described above would be:

T(A,B,C,D,E) :- P(A,B,C) Ù Q(D,E) Ù C = D

The predicate can be arbitrary (C < D, for example). Cohen describes a similarity operation on text fields (C~D) used to join relations from possibly unrelated sources.

Similarity Predicate in WHIRL

Given a dictionary of terms T, documents (or text fields in a tuple) can be represented as vectors v, w in N-dimensional space, where N is the number of unique terms in T. For any term t in v, v_t= the magnitude of the component of v in the direction of term t. v_t is longer if t is mentioned frequently in v (TF = term frequency, or number of times t is mentioned in v ). v_t is shorter if t is mentioned frequently in other documents in the same column as v (IDF, or inverse document frequency). Cohen gives v_t as basically:

((log(TF_v,t) + 1 ) * log(IDF_t)) / (normalization factor)

to guarantee 0 £ v_t £ 1. The similarity between documents v and w is then given by their dot product:

SIM(v, w) = S _tÎ
T (v_t w_t)

Which is interpreted as the cosine of the angle between v and w. 0 £ SIM(v, w) £ 1, and will be large if the documents share many important terms. Note that the dictionary employs stemming algorithms to handle suffix and tense differences (run = running, etc.), and may ignore unimportant terms or stopwords (a, and, the, etc.)

WHIRL, STIR, SPIRAL

SPIRAL is an information collection and retrieval system that supports logical database queries across multiple unrelated data sources. The SPIRAL architecture consists of a database builder, and a query processor known as WHIRL (Word-based Heterogeneous Information Retrieval Logic).

Information extraction and database construction

A database for a SPIRAL domain is built by extracting a set of relations from multiple unrelated data sources. Relations are constructed automatically by applying conversion programs to HTML sources. Conversion programs take HTML parse trees as input and attempt to match paths in the tree with supplied patterns. When a pattern is matched, data is extracted from a leaf node and classified according to the conversion program. As an example, the wrapper:

html - body - ul - li as movielisting - b as moviename

traverses the input tree to locate paths that look like html - body - ul - li. Leaf nodes with HTML attribute b are placed in a relation called movielisting under the field moviename. Conversion programs are hand-coded, and Cohen’s claim is that they take an average of three to four minutes to develop. Data is stored in STIR (Simple Texts in Relations) format, or free text in each tuple field. The wrapper language has additional controls to handle non-HTML-path formatted data ( -based), an escape to Perl, and the ability to do multiple passes with different conversion programs.

There was some discussion about the expressiveness of this wrapper language (less the Perl escape) as compared to Nick Kushmerick’s HLRT. Consensus seemed to be that they were about equally expressive, but Cohen’s is better equipped to deal with nested data, unless the nesting structure is known up front. It was suggested that HLRT was probably better at avoiding false positives in head and tail areas that might catch SPIRAL.

Query processing

Queries are applied to the database through WHIRL, an extension of Datalog. Queries are conjunctions of relation selectors and similarity predicates. WHIRL performs a "soft join" on selected relations by substituting the notion of equality with a similarity metric as described above. Rather than selecting tuples which meet an equality condition, WHIRL returns tuples ranked by similarity score. When the query contains joins of more than two relations, the results are sorted by the product of the similarities of each join.

WHIRL uses an unspecified A* search mechanism to evaluate and return only the most promising tuples. It was shown to perform better on at least two databases than the so called naïve method of joining relations:

for each document in ith column of relation P

submit it as IR-ranked retrieval query to corpus corresponding to elements of jth column of relation Q (lookup uses inverted index)

save top r results

merge top r results from each iteration to find top k overall results

There was some speculation that WHIRL used IDF information to prune the search. There was an unresolved question about the asymptotic complexity of WHIRL compared to other optimization methods.

Strengths

Soft join

Simple wrapper language. Similarity joins can make up for wrapper imperfections.

Search mechanism seems fast compared to other optimization methods.

Performance numbers suggest that the system will scale in database size.

Simple database format. Field names don’t matter.

Limitations

Database is offline.

Doesn’t appear to be useful for numeric data.

Supports conjunctive queries only.

Limited to HTML sources only.

User front-end must be canned to some extent.

Comments

Alon strongly suggested that the system’s main contribution was its attack on the problem of joining objects from different data sources where keys don’t match.

Some of the best work in implementation was left undescribed. Not enough detail on the search or on the construction conversion programs.

The experiments were self-selected, data was drawn from a small sample.

(01/15/98) Datalog, First-order Logic, and Description Logic examples

Datalog

Datalog is a function-free subset of prolog. Without negation, it is equivalent to Horn predicate logic. A datalog program is composed of rules, with a head (a relation), a ":-" symbol which is read as "if", and a body (a conjunction of relations). A relation is a function symbol, followed by some variable or constant arguments in parentheses. Any variable appearing in the head must appear in the body. For instance, we define paths in terms of edges in the datalog program:

path(X,Y) :- edge(X,Y).
path(X,Y) :- edge(X,Z) & path(Z,Y).

First-order Logic

For those familiar with first-order logic, the above program has a declarative semantics in FOL/English:

forall(X,Y)   ( edge(X,Y) => path(X,Y) ) AND
forall(X,Y,Z) ( edge(X,Z) ^ path(Z,Y) => path(X,Y) ).

Description Logic

Description logic is a different subset of first-order logic. In contrast to datalog,

There are no explicit variables.
There is negation.
There is limited disjunction.

An example (somewhat distorted by my weak HTML) is a query for papers written only by researchers whose authors include an American and a non-American:

Paper ^ (ALL Paper-Author . Researcher)
      ^ (EXISTS Paper-Author . American)
      ^ (EXISTS Paper-Author . ~American)

another example, queries for the papers with at least two authors who are people, not chihuahuas.

Paper ^ (>=2 Paper-Author) ^ (ALL Paper-Author . Person)

Predicates, relations, and queries

There are two kinds of predicates in datalog: those that are enumerated (the edge predicate is assumed to be enumerated somewhere, in a database) and those that are derived. In the database theory community (i.e., Jeff Ullman's students), the terms for them are EDBs (extentional database predicates) and IDBs (intentional database predicates). A conjunctive query is a datalog rule, defining a new query predicate (perforce an IDB) in terms of other predicates. A view also is defined in the same way as a query, though the term is usually used to indicate an IDB with extended existence and reuse, while a query is more often a metaphor for a one-shot deal.

Subsumption, query containment, and entailment

One question we can ask of our databases is, what are the ground facts (a.k.a. tuples) we can derive from them? For instance, what are the pairs (X,Y) such that path(X,Y)? This leads into the subject of evaluating a datalog query over a database, a subject which was studied to death by Ullman's group.

A more interesting question involves two queries. We have given two examples of queries in description logic in the previous section. The answers to the former will all be answers to the latter as well. So we say this the latter query subsumes the former. An analogous notion applies to pairs of datalog queries. Query Q1 contains query Q2 iff all the facts (tuples) returned by Q2 are always returned by Q1 regardless of the database . Equivalently, in FOL, FORALL (X,Y) Q2(X,Y) => Q1(X,Y). If a query contains another, then a containment mapping exists between them. A containment mapping is a substitution of variables and constants (of the containee) for the variables of the container, such that each conjunct of the containee appears in the substituted container.

Answering queries using views

Suppose you have a set of EDB predicates in which you understand the universe. You have some WWW form-based interfaces to databases, v1 through vn, which are materialized views (IDBs) of these EDBs. Then suppose you get a query, defined solely in terms of the EDBs. Levy, Rajaraman, and Ordille give an algorithm to find a query defined solely in terms of the views that returns only answers to the original query, perhaps all the answers. This query, in contrast to the original, can actually be run.

Input is Q() :- q1(), q2(), ..., qn().
For each qi, find any view vj which is relevant to qi.
Substitute some combination of vj's for qi's to form new query Q'().
For each view in Q', substitute its definition in terms of EDBs, to get Q''().
Check that Q'' is contained in Q.
Repeat for all combinations.

This algorithm is exponential in the length of the input query, for two reasons. First, it must try all combinations of views relevant to each conjunct. Second, the inner loop does containment mapping, which is exponential in the number of repeated predicate symbols in the query.

Miscellaneous

Also discussed were capabilities of a data source. An ordinary "traditional relational" database allows you to download all facts in the whole data relation, or query it on equality with any field. Negative capabilities would imply a more restrictive input/output scheme. Positive capabilities would include SQL or other more complex input, to be processed by the source.

(01/20/98) Information Integration Systems: Razor, TSIMMIS, IM

The paper has two bugs. The first is right before equation 11 on the page with section 5: the call to Contains with C should be a call to Contains with C’. The second is in equation 12: there should be no primes.
Whether the arrows should be single or double edged turned out to be related to what the notation in datalog means. The datalog expression p(x,y) :- e(x,y) logically translates to " x,y e(x,y)->p(x,y) also means p(x,y)->e(x,y).
If there is more than one such rule, then the inverse involves taking the union
For the purpose of this paper, if it says implies, it only means implies, not implying the converse
This is acceptable because, among other reasons, it should never be assumed that anything is complete.
In a web system, it is important to be able to express both:
Web site x has reviews for movies
Web site y has all reviews for movie z
Razor is derived from IM (Information Manifold).
The answer to how you take a query in world relation and create a plan to query databases is described in detail in O. Duschka and A. Levy. Recursive plans for information gathering. In Proceedings of the 15^th International Joint Conference on AI, 1997.
In an example of trying to find reviews of all of the movies starring Harvey Keitel that were playing in Seattle after finding the name of movies in the Internet Movie Database, there are two different places that the user can go for reviews; Ebert, which provides only reviews by Roger Ebert, and Movie Link which provides, while not an exhaustive list, certainly a large subset of them. The Razor paper describes three different methods of finding information that is assumed to be "locally complete." This term means that it is guaranteed to find all of the information that is available given only the sites that they went to. All involve the notion of subsuming.
A source X is said to subsume another source Y if all of the information stored in Y can also be found in source X. For example SABRE subsumes United, because it contains all of the information about all of the flights on United, but it does not subsume Southwest, because it does not provide information about Southwest.
Information sources that are subsumed by another should not be gotten rid of, however, in case the subsuming site is unavailable. Thus, there are (at least) three possible execution policies, all of which become more interesting if there are more (resource) limits than are currently imposed on web systems:

Brute force – Just ignore subsumption, and execute everything greedily. This method annoys both servers because of the large amount of traffic that is generated and clients because the clients then have to wait for and sift through all of the information from all of the sites, even if it is redundant.
Aggressive – Execute both alternatives in parallel, but cancels all communication with B once A has successfully returned. On the web this does not make life any better for the server because all of processing has to be done anyway, and only the bandwidth for returning the information is saved. However, it does aid the client because the client then does not have to go through duplicate data.
Frugal – Initially, run only A, if A fails, then run B. On the web, this method currently does not have many advantages over aggressive from the client side; the benefit is only in saving one call, but that really doesn’t make that much of a difference even on computers with low bandwidth. If there was a charge for accessing information, then it would make sense, but that’s not likely to happen "for at least the next 12 months" [Weld 98]. On the server side this is obviously preferable because the server doesn’t have to service useless calls.

The advantages of the frugal system are less likely to be seen because people don’t like to wait. Some suggested that this can be ameliorated by showing that much progress has been made, or that at least something is going on. The question of whether users can be expected to wait for more complex queries was also raised.
From the WHIRL paper, we know that joining is a very difficult operation, and the question was raised as to how well Razor handled it. The answer was "not well."
Sources either needed to be rigid or there needed to be really good wrappers.
The current system attempts to normalize their response, but sometimes it fails. One member of the project team was heard to say, "It sucks."
In a related note, the question of how the system would deal with identical articles was also brought up. The answer was that if both articles were exactly the same, including the representation of the author’s name, they would probably be recognized as the same and not duplicated. Otherwise they would probably be duplicated. Whether or not the database community has handled these sorts of situations (an author listed as J. Smith in one table and John Smith in another) was raised, and the answer was that it has not been looked into very much; perhaps WHIRL is the best thus far. The methods thus far have all been very domain specific, there is no domain independent way to do so (this was mentioned to be a good project idea). One suggestion was to use probabilistic information about two objects to see if they are equal.
In general systems tend to do a bad job of searching different forms on the same database. If there are two different access patterns to the same data, then two forms are needed.
Marc said that one difficulty in building the system is that if the system you are relying on to get your data from gets smarter, then your system must become smarter in order to avoid obsolescence. Dan and others disagreed saying that the system would still make it so that only one web site would have to be searched
The limits of local completeness as implemented in the Razor system were counted to be the following:
No negation; for example you cannot ask for all of the reviews except those by Ebert.
No disjunction on the right hand side.
No way to express that the union of two sources subsumes another source.
No less than, equality (where equality refers to equality on the same variable) or numbers.
While the right hand side can refer to sources and many other things, same as the world ontology, the left hand side can only be the source.
It is impossible to represent the idea that two data sets are disjoint.
Dan raised the question of whether local completeness is useful. The answer degenerated into a discussion of its limitations. However, it was mentioned that it is helpful in situations where you’re interested in making sure you have "all" of the information, and with local completeness you are assured of knowing when to stop.
Do restrictions on a subject make it easier or harder? For example, is it easier to say that there is a database that has all information on cars, or a database that has all information on American cars manufactured between 1972 and 1983? This is important for figuring out which information is irrelevant and if one source subsumes another. On the web, however, you rarely want to say that something contains "all" information because it’s hard to believe that anything is exhaustive.
The paper contained some illustrations about the way that information was gathered and joined together, and the question was what is the relationship between that representation and a datalog query. The answer was that it was the same, but the graphical version was easier for the average person to understand.
In the picture, no loops corresponded to the datalog statement not being recursive. An example of where you might run into a loop was that you wanted all of the ancestors of Harvey Keitel and had a database that would list parents of a person. You would enter Mr. Keitel, get his parents out of the system, feed them back in, get their parents, etc.
The graphs and the local completeness they represent are not unique. They are all equivalent in output, but not in speed; this aspect is not addressed in the paper.
In practice, whenever you get a loop, it often turns out that you really don’t want it.
If this technology was portable and integrated (taking the best from Whirl, Tsimmis, Razor, etc) would people want it?
It isn’t domain specific which would make it more desirable.
Some suggested applications were books and cars.
The "killer app" for the systems, however, is not the web but companies with large numbers of large databases. They often have no idea what is in them or how to relate the data in one to the data in another regardless of whether or not the data is already in the same format across the databases.

A comparison of the different systems:

Site oriented - Both IM and Razor are site oriented, while Tsimmis is not. By site oriented I mean that it is easy to add new sources easily by simply describing what it has. This involves a tradeoff between the speed that the system has at runtime and how easy it is to add new databases. In a static world where the databases and their format rarely changed, it would be considerably more sensible to use Tsimmis, however on the web, a more site oriented system is probably desirable.
Completeness - can you get all of the information based on what the sources had. In IM it is only possible if you are using the version with Duschka’s algorithm. This is because there were several iterations of IM, and not all of them were complete. Tsimmis was not, and Razor was.
Local completeness reasoning - You can’t express it in Tsimmis, but both IM and Razor use it.
Other separating ideas are term definitions, interpreted predicates, word source and non-relational data.

Finally, the Alon Levy magical mystery list:

Relevance of a source.
Redundancy of a source.
Irrelevance of a source.
Unusability of a source.

(01/22/98) Wrapper Induction

This paper demonstrates machine learning techniques for automatically construct wrappers from examples. Automatically learning wrappers is useful since wrappers are usually hand-coded and are thus expensive to create and maintain.

These techniques are for learning wrapper procedures. To this end, wrappers are encoded using an LR-like language, where each attribute k_i has delimiters l_i and r_i indicating the beginning and end of that attribute. For example, in

Some country codes<P>
<B>Congo</B> <I>242</I><BR>
<B>Egypt</B> <I>20</I><BR>
<B>Belize</B> <I>501</I><BR>
<B>Spain</B> <I>34</I><BR>

attribute k₁ is country name, which could have "" as l₁ and "" as r₁. The advantage of using this language is that to learn a wrapper procedure, the system need only learn values for <l₁, r₁, ..., l_k, r_k>. This is one of the main strengths of this paper.

Note that the values of l and r are not unique. For example, ">cr", where cr is a carriage return, is another possible choice for l₁. One purported advantage of this representation for wrappers is that delimiters do not need to be HTML tags, and might have nothing to do with HTML. However, members of the class were doubtful that the l and r delimiters would frequently be anything other than HTML tags.

To generate a wrapper for a new domain, the system presented in this paper goes through three steps:

Gather enough pages to satisfy the termination condition (PAC model)
Label example pages
Find an LR wrapper consistent with the examples

To stick to the sequence these topics were presented in in class, we'll discuss the third step, then the first, then the second.

Finding a wrapper

The naive approach to learning the values of <l₁, r₁, ..., l_k, r_k> would be to try all of the possibilities, which would run in O(S^2K), where S is the length of the shortest example and K is the number of attributes. However, by assuming that the attributes are independent we can reduce the running time to O(KS).

The algorithm is fairly straightforward. To find an l delimiter, look for the longest common prefix, and for the r delimeter, look for the longest common suffix. For example, to find r₁ in the example, the longest common suffix would be:

        <B>Congo</B> <I>242</I><BR>
        <B>Egypt</B> <I>20</I><BR>

There was some debate in class whether it was better to favor the longest or the shortest possible delimiters. It was pointed out that this decision is based on whether you prefer to have false negatives or false positives, respectively.

LR isn't general enough to describe all the wrappers one might want. For example, if "" is used as an l delimiter, but also appears in the title, then that part of the title will incorrectly be considered an example. HLRT compensates for this by introducing h (head), a section at the beginning of the page to skip over, and t (tail), a section at the end of the page which is also skipped. Thus, an HLRT wrapper is described by:

<h, l₁, r₁, ..., l_k, r_k, t> Things get trickier because l₁, t, and h interact. This increases the running time of the algorithm to O(K S²), which is still better than the naive approach's running time of O(S^2K+2).

The PAC Model

The learning algorithm terminates using a PAC (probably approximately correct) model, that is, with high probability, the wrapper learned is highly accurate. That is, if E(h) is the probability that hypothesis h is wrong on a single instance selected randomly, the algorithm stops when: Prob(E(h) > eps) < delta

eps is known as the accuracy parameter, and delta is the confidence parameter.

Once you solve the yucky math, you discover that the predicted number of pages the algorithm must use as training examples is independent of the number of examples, linear in 1/eps, logarithmic in 1/delta, and logarithmic in S. However, the PAC model prediction is way too high; the system does far better in practice. It's not clear why the PAC model is so high. It could be because the pages aren't independent, which the model assumes, or that the bounds need to be tightened. The class didn't reach a consensus.

Labeling example pages

Labeling example pages relies on recognizers, which recognize instances of a particular attribute, such as country names or country codes. They may be perfect, unsound (produce false positives), incomplete (produce some false negatives), or unreliable (produce false positives and negatives.) Note that even with perfect recognizers wrappers may be needed, since wrappers must be fast while recognizers might not be. As long as one perfect recognizer and no unreliable recognizers are used, the corroboration algorithm used in the paper will succeed. If an example has missing attributes, the HLRT wrapper learner throws out that example (when training for that attribute, anyway,) and if there are unsound recognizers, the learner branches on those attributes.

Analysis

Wrapper induction appears to be practical. It took the system about 1 CPU minute to learn each wrapper, requiring 4-44 pages to get 100% accuracy. The system is very robust - can have crummy recognizers and still not need many pages. Finally, it was determined that HLRT and SPIRAL are incomparable. SPIRAL can't handle non-HTML text, while HLRT can't handle hierarchical structures.

(01/27/98) ILA and Shopbot

Papers:

Category Translation: Learning to Understand Information on the Internet, Perkowitz & Etzioni, IJCAI-95
A Scalable Comparison-Shopping Agent for the World Wide Web, Doorenbos, Etzioni & Weld, 1997 Conference on Autonomous Agents (AGENTS-97), p39.

Auto Modeling of Internet Resources

How represent? (i.e., declarative modelling as in IM/Razor)

Info content (or effect on world)
Capabilities (binding patterns, remote join)
Quality

How learn models automatically? (as in ILA)

Discovery
Protocol
Semantics
Quality

Possible project idea: Survey of models of Internet resources; which is best (for what purposes)?

ILA describes several subproblems of modeling (as above):

(Italicized problems are under-researched and good project opportunities)

Discovery: How to find new sources?

In the CD domain, might train on known CD stores to classify them; then, watch "what's new" lists for new CD stores.

Protocol: How to issue a query (addressed in ShopBot) and process results (wrappers)?

Semantics: Interpreting the meaning of the page

Quality: How to know how good info is: rating a site as by

Breadth (local completeness)

Accuracy

Speed

Availability

Maximization of some domain-specific quantity (i.e., savings on price) which may rely on semantics, user input, or past queries

Category Translation

Given

Incomplete internal world model

Some records may not have all fields; some records may be missing; this knowledge may be discovered in the learning process; however, it is not possible to learn new attributes for objects.

Objects
Attributes

External Info Source

k functions from string to ?

Determine

Query: lastname(person)

Response:

firstfield = firstname(person)
4th field = mail-stop(department(person))

Correspondence Heuristic - the key thing that makes ILA work. Assumption (similar to Kushmerick's) that the format of a query's response is always the same. In contrast with Kushmerick's system, if ILA knows for sure that field 2 is the name, it's set. ILA can't be sure of the matches (semantics); HLRT can't be sure of the placement of tokens.

Inductive Bias

Determination

With m is a model, o is an object, I is the information source, I_j(o) is the j^th field in a response from I querying o, S(I,o) means that o is in I, and M^*(o) is some logical formula of o such as (mail-stop (department o)):

Determination is represented by: $ m" o₁o₂[S(I, o₁) & S(I, o₁) & I_j(o₁) = M^*(o₁) => I_j(o₂) = M^*(o₂)]

This assumption makes the query answer learnable, paring down the space of possible formats. I.e., if it were learning the price of a stereo, the problem would be too big without the assumption that a model that works for one would work for all.

< /UL>

Question: Is the model mapping between field position and attribute?

Answer: Between field position and some composition of attributes.

Question: Does this require machine learning?

Answer: Might be done with exhaustive search, but even that would be a type of machine learning.

ILA

Query IS, search for correspond, and rank hypotheses

Which object for query

Maybe it would be possible to search offline for potential puns in a query, and choose only the relatively unambiguous ones to start with.

How map object to string for query

Just choose a single, first-level token in the record and see if it works (note that objects have both tokens and fields. Tokens can be used for queries).

What candidate explanations for response

How evaluate competing hypotheses

Problem: don't know which fields will matter, particularly with composition

Problem: prices and varying data do not fit into this system because it is based on pure matching recognition

Problem: how to handle multiple results?

ILA Solution: check each result, process against given query separately. Generally throws away irrelevant results on multiple results. This policy could cause puns.

Question: if field goes from object -> token, how can they be composed?

Answer: some fields go from object->object; others go from object -> token. All chains must end in a token.

Problem: loops in the field references (Bob is ( student (advisor Bob))).

ILA Solution: Breadth first search won't get caught in loops. Also any shorter hypothesis will be reached before longer ones (like (last-name (student (advisor Bob))))

Problem: The BFS solution imposes an arbitrary depth bound; however, the "length" of a chain is dependent on the ontology used; changing that ontology could move valid chains in and out of the depth bound.

This is part of the general difficulties inherent in logically formulating knowledge. Even in a simple world it is hard to know what the "right" structure is and what depth bound is appropriate.

Comment: this system seems to put most of its complexity into the construction of its ontology

Response: no, the semantic translation is still a difficult (and interesting) problem

Comment: ILA's ontology affects the format in which final responses return. In other words, only ILA's view is represented in the final response.

Evaluation

Experiments

Theory

PAC is way too loose. 2 problems:

Discriminating examples help a lot.
Don't care about extremely similar hypotheses, just some reasonable probability apart.

Note that depth bounds do not affect the given formulae because they are based on pairwise hypotheses that have already been found.

Paper

Comment: ILA seemed general at first, but implementation requires a lot of narrow, special-purpose coding (ontology).

Response: ILA is meant to be used in situations where a small amount of knowledge has already been collected. It's the bootstrapping that's central to developing new domains.

ShopBot basics (similar to ILA)

Given

Incomplete domain model

Example products
Attributes of products

URL for home page of a vendor

Determine

Procedure taking product -> string form of tuple

Comment: low-tech approach that worked (low-lying fruit)

Comment: Is there any deep approach in ShopBot? May be masked by an aversion to detail.

Environmental regularities - Dan really likes this part: ShopBot leverages regularities that need not exist but do; this technique is drawn from the robot agent world.

Navigation regularity

Uniformity regularity

Vertical separation regularity

Question: Will standards make this obsolete?

Answer: Probably not any time soon. The difficulty with standards is that everyone wants a standard that favors their product. Example: clothing. Haven't even standardized sizing, how will they standardize everything else?

Comment: ShopBot drives some of its searchees to its standard.

Comment: BargainFinder's antagonism got it locked out of some sites.

Learning a proto-wrapper

Identifying the right search form

Determining how to fill in the form (all possible field/attribute combinations tried; uses a heuristic to cut down)

Parsing out tuples

Cool stuff: Failure checking. Tries a bogus search first to see what failure looks like.

Evaluation

Utility

New software vendors

Generality across domains

Papers

Question: How does it parse tuples?

Answer: Domain-specific heuristics.

Question: Can ShopBot get company stores?

Answer: No, "common" software not at any particular software company (i.e., Intuit doesn't have Exchange, one of the packages used to learn sites)

(01/29/98) Wrapper Construction Continued

Semi-automatic wrapper generation - Ashish & Knoblock

This paper presents a system for generating wrappers for web pages. It assumes information is organized hierarchically on a page and that sections are marked by words indicating what type of information follows. They also distinguish between multiple instance and single instance sites, those which produce items one per page (countries in the CIA World FactBook) and many per page (Yahoo list of countries by region) respectively. In order to create the wrappers, they provide a set of LEX rules for extracting section headings. Using those tokens, they use hand-coded heuristics based on font-size and indentation to produce a grammar which can then be fed to YACC. Finally that add networking code and give the parser the location of the pages needed to answer a query. They perform an experiment in which wrappers for 14 sites were created. They measured the number of changes needed by a human and the time required to produce the wrappers.

People generally didn't like the paper. The consensus was that the system doesn't do much - it's just a bunch of LEX and YACC rules
However, dissenters claimed it gives a great demo
Another complaint was that some would like to see it applied to another domain
There should be a canonical test set for testing web agents
Need to hardcode query
made bad assumption that the unit of interest is the page

Grammatical inference - Freitag

The idea behind this paper is that there are good information extraction for finding the general location of fields in text, but these techniques aren't very good at identifying specific field boundaries. By learning grammars for each field, (e.g. Names often start with Dr. or Mr. and consist of capitalized words) the boundaries can be determined more accurately. The system proceeds in two phases, first an "alphabet transducer" is learned to map input into appropriate terms, then a grammar induction is run on the transformed inputs. Standard grammar induction algorithms are only effective if they have appropriate features (e.g. Capitalized, Numeric, etc.) rather than just words, however sometimes actual words are the pertinent features (e.g. Dr.), so they first learn an "alphabet transducer" which transforms the input stream into a feature vector with features at an appropriate level of abstraction. They use a covering algorithm such as Foil or CN2 to select features which best predict field membership. Once the alphabet transducer is learned, its output is fed to a standard grammar induction algorithm to produce a grammar which accepts fields. Experiments were done which showed that on fields which were easily detected by a naive bayes algorithm, adding grammar induction didn't hurt much, but for fields on which naive bayes didn't do so well, adding grammar induction could improve it significantly (even though the grammar induction rules alone didn't do well on the data either.)

General agreement that this paper had good discussion, experiments and data
Abstraction of text to more general features is a great idea
Independence of features is important, separation is good
Suggestions were made to improve the system by preprocessing with recognizers a la Nick's wrapper induction and to use hierarchical information to improve results

Learning Symbolic Knowledge - Craven et al

The goal of the project is to learn a large ontological description of the web and its contents. The approach of this paper is to start with an ontology with entities and relations and to learn instances of these entities and relations on set of web pages. The assumption is that there is a correspondence between web pages and ontological entities and hyperlinks (or chains of links) and ontological relations. Based on this assumption, they use standard classification techniques to try and determine the ontological status of a set of pages and then using those classifications infer that links between pages of particular classifications are instances of relations between the corresponding ontological entities. They also present several techniques which allow them to trade off coverage for accuracy in their system. Finally, the give a long laundry list of ideas related to this research, some extensions or improvements of their system and some general problems to be solved.

Did a good job on a simplified version of the problem
Lots of good ideas - good challenge paper
Nothing more than a homepage finder?
Need to distinguish web objects and real objects in ontology
Also made page-unit assumption

The great wrapper debate (to wrap or not to wrap)

At some point the question of the need for wrappers at all was posed. The argument against is that at some point, vendors will all agree on a data-interchange format and instead of trying to reverse engineer their databases from query results which have been formatted for humans, we will instead be able to query the databases directly or at least have machine-readable query results. The arguments for are that (1) it'll never happen because of the economics and (2) even if it does happen there's still lots of data-integration issues such as object identity and semantic heterogeneity. The final argument for was that there will still be a lot of information which is unstructured or semi-structured that will always need wrapping, so we should really be concentrating on this type of data instead of the loosely veiled database-query information.

Dan's slides

Finally, Dan presented a bunch of slides about issues in internet agent design. The basic message of the slides was "there's a bunch of different types of regularities, let's go exploit them."

Softbot Perception

Taking one or more tokens and assigning them an internally meaningful label.

Examples:

an html page is labeled as "Tom Mitchell's home page"
a set of digits is labeled as a "UW phone number"
a line of text is labeled as price(Encarta) = $69.95.

Decompose "perception" into

Token translation:: Assigning semantic labels to individual tokens.
Identifying units:: Putting together tokens into higher level descriptions.
Information extraction:: Using partial description of units and tokens to draw a conclusion about the perceived information.

Theme: REGULAR FORMATING:

Units often organized in a regular fashion, For example: separated by standard delimiters (e.g. line breaks, commas, or spaces).
Font and size information
Capitalization

Theme: SEMANTIC EXPECTATIONS:

E.g. we can expect the price of Encarta to be within a certain range.

Semantic expectations are useful when they help to make perceptual decisions;

E.g. we can conclude that the number we find, which begins with the area code 415, is unlikely to be the office phone number of a person who works at the University of Washington.

Theme: HIERARCHY

Units are often organized hierarchically
For example:

Home pages contain addresses,
- addresses contain names, and
  - names are decomposed into first and last name.

Catalog descriptions contain list of items,
- each item contain a price, and
  - prices are typically preceded by $.

Theme: CONTEXT:

Missing info can often be inferred from context.

For example,

When an extension is given instead of the whole phone number, often there is a standard prefix associated with an institution
One can infer the extension's area code by knowing the city where the information is located.

Theme: MULTIPLE VIEWS:

A unit can be viewed using radically different filters or abstractions suggest regularities that may be otherwise missed.

Examples

analyzing horizontal and vertical spacing,
searching for patterns in the HTML source for a page,
trying to identify standard delimiters, and
looking at the type description of tokens, such as numbers versus letters, etc.
Bayesian + grammatical inference

Theme: MULTIPLE EXAMPLES

Structure + regularity are apparent when comparing several examples.

ILA uses multiple exs to resolve semantic ambiguities,
WIL uses multiple exs to identify different syntactic variations,
ShopBot uses multiple examples to identify product description headers and item descriptions.

Theme: SEEDS

Learning from multiple exs may be sped up by taking as input a seed description, capturing our expectations, which is then refined and updated as necessary by additional examples.

For example,

We may start with a simple description of what a phone number looks like and update it as we encounter syntactic variations.

Theme: PRE-PROCESSING OF ITEMS

Searching for regularity often benefits from simple preprocessing of items to abstract their descriptions and extract commonalities. For example,

items may appear much more similar if described only in terms of whether they contain letters or digits;
likewise an item that is described in terms of higher level tokens such asprice, product name, vendor, etc. may have some structure that is much easier to spot.

Shopbot uses abstract HTML fragments
Freitag learns specialized alphabet transducers

Theme: PARTIAL SUCCESS

In many cases, we may only be able to partially label a particular perceptual input. Sometimes this is useful (sometimes not)

Type of partial success
Correct but incomplete, vs Unsound

Theme: EVIDENTIAL RELATIONS

Often different tokens and units suggest the presence or absence of other units

For example,

A zip code may suggest an address, and
Address may suggest that we are looking at a home page,
Which may in term suggest that a phone number ought to be present.

(02/05/98) Constraints and the Web

We discussed Alan Borning's work on applying constraints on the web.

Alan Borning has been looking at using constraints to solve the problem of laying out web pages. The main motivation is that html itself is a crude way to do layout. Html was originally intended to specify the semantics of a document. However, Author's want more control over the appearance, but cannot control the how a page is viewed. Therefore pages look great on the view settings they were designed for, but may not look so good as the settings change. What is needed is a way to control how the layout of a document changes when the viewing parameters are changed. Another problem is layout of automatically created web pages.

Constraints are one possible method for solving the aforementioned problem. Intuitively, a constraint is a relation that we want to hold. An example of a constraint is the statement: x < 7. The author specifies what s/he wants via constraints. A constraint solver determines how.

Types of constraints that could be used to layout web pages

label placement relative to figure
range on column width
requirements on browser window size
fonts

The author and the viewer of a web page, both may have certain constraints which they would like to hold. The process of solving these sometimes conflicting constraints can be seen as a negotiation. Constraints also come in two types: requirements, and preferences. Preference constraints may have different associated weights attached to them.

Alan talked about 2 different system implementations of the negotiation model

Both author and viewer do runtime constraint solving
Author's side does runtime constraint solving. Viewer has a fixed set of constraints. An example would be an authoring tool that compiles a Java program that takes the user's constraints as parameters.

Next Alan talked about the architecture that he has been pursuing, constraint style sheets which are separate sets of constraints produced by the author to be used under different circumstances. It is akin to a fuzzy if-then type match of preconditions and resulting set of constraints.

There are some HCI issues with this approach. One major question with constraint authoring is how easy would it be for people to author constraint pages. One solution is for some provider to develop a set of constraint templates. The author would choose one and then maybe modify it somewhat. Also, someone suggested that programming by demonstration may be another avenue to explore.

We next dealt with the Solvers of Constraints. Solvers that we looked at derive from the simplex algorithm for linear programming and are listed below.

Cassowary - incremental solver that finds locally error better solutions.
QOCA - incremental solver that finds least squares solutions. (slower but better that #1)
Fourier - batch compiler that produces code for finding locally error better solutions.

Simplex Algorithm solves the linear programming problem which is defined as follows. given a set of variables ci >= 0 for i from 1 to n There are m linear equality or inequality constraints over xi for i from 1 to m.

a1x1 + ... anxn = x,
a1x1 + anxn <= x etc.

Find values of xi that minimuze objective function c + d1x1 + ... dnxn

Cassowary is an incremental version which is optimized for incremental addition and deletion of constraints such as those in continuous moving figures.

The prototype implementation of the web-constraint system consists of a viewing tool and an authoring tool. The only difference between the two tools is that the viewing tool only has a subset of the functionality of the authoring tool. Cassowary is the solver used.

Some more issues.

Issue #1 Sudden changes of layouts are disconcerting for the user. A solution in a template based constraint system is to have overlapping preconditions for templates, and a penalty function for switching.
Issue #2 It may be hard to get unique web pages with a template based approach. Uniformity is helpful betwen web pages from the same organization. However, different organizations would like to have a unique style for their own pages. This seems to call for a template authoring tool instead of predefined templates.
Issue#3 Combining constraints with strudel.... Strudel and the constraints work address orthogonal issues in web design, can we integrate them?

(02/10/98) WebSQL: "Querying the Web" by Mendelzon, Mihaila, Milo. 1996

The paper presents WebSQL, an SQL-like language for querying the Web that allows both textual and structure- or topology-based queries. The design emphasizes "query locality," a theory of query cost based on how much of the Web must be visited to answ er a query. WebSQL uses multiple, existing index servers when appropriate, so it leverages off available resources. The system is envisioned by the authors not as an end-user application but as a development tool.

W3QL [KS95], our paper 1.4, and
WebLog [LSS96], Lakshmanan, Sadri, and Subramanian. 1996. "A Declarative Language for Querying and Restructuring the Web." Proc. of 6th International Workshop on Research Issues in Data Engineering, RIDE '96, New Orleans, February.

WebSQL emphasizes a clean, minimal design with formal semantics, query locality, and portability; W3QL emphasizes extensibility and using utilities available in the Unix environment; WebLog uses Datalog-like recursive rules rather than regular expressions for specifying paths.

The main topics taken up in the discussion were:

The Features and Capabilities of WebSQL

WebSQL takes a minimalist relational approach to modeling the Web. Web objects are modeled with the relation:

	Document(url, title, text, type, length, modif)

and links are modeled with the relation:

	Anchor(base, href, label)

Example 2.1 from paper

	SELECT	d.url, d.title, d.length, d.modif
	FROM	Document d
		SUCH THAT d MENTIONS "hypertext"
	WHERE	d.type = "text/html" ;

Example 2.2 from paper

	SELECT	d.url, d.title
	FROM	Document d SUCH THAT
		"http://www.cs.toronto.edu" = |->|->-> d
	WHERE	d.title CONTAINS "database" ;

MENTIONS and CONTAINS

A key point made in the discussion of the BNF specification of WebSQL was the distinction between the seemingly similar keywords MENTIONS and CONTAINS. MENTIONS is used, optionally, in specifying a DomainCond in the FROM clause of a query (Example 2.1 ); CONTAINS is used, also optionally, in the WHERE clause to specify a BoolTerm (Example 2.2). MENTIONS is used with the first argument free and the second bound; it generates bindings for the first argument. CONTAINS is used with both arguments bound; it evaluates bindings. (The distinction gives rise to query optimization opportunities, but the subject was not pursued in the paper or the discussion.)

Query Completeness, Safety, and Computational Complexity

There was considerable discussion of query completeness, safety, and computational complexity in the context of querying the structure of the Web. The sense of the discussion was that there is a tension between completeness of query answers and comput ational tractability. On the one hand, it is highly desirable that the results of any query be complete, or, at the least, that you know whether the query answer is complete or not. On the other hand, an arbitrary query of the Web that gives a complete answer may be computationally intractable either because of the size of the domain searched -- the Web is large -- or because of the type of query made. This tension leads to the need for careful consideration of bounds and range variables to specify the search space, and, of the types of queries permitted. That is, the query language used. Query locality, discussed next, is one strategy within WebSQL that attempts to address this problem. (It was also noted that, as a practical matter, WebSQL's answe rs are incomplete (over the entire Web) because it relies on search engines that do not cover the entire Web.) Alon noted that from a database perspective, it is always necessary to "think big", that is, in terms of large amounts of data. While NP compl exity may be acceptable in small-knowledge-base applications, it is highly unlikely ever to be acceptable in a database context.

There was no clear resolution of this discussion, but Dan noted that the issues raised are similar to those addressed by Doyle and Patil in "Two Theses of Knowledge Representation." The paper is available at:

	http://www.medg.lcs.mit.edu/ftp/doyle/td91.ps

(For a different -- theoretical -- perspective on computational complexity and the Web see Abiteboul and Vianu, "Queries and Computation on the Web," cited as [AV97] in the WebSQL paper and available at

	http://www-rocq.inria.fr/~abitebou/pub/INDEX.html

under Database Theory. [AV97] models the Web as an infinite, semistructured set of objects to capture the intuition that exhaustive exploration of the Web is -- or will soon become -- prohibitively expensive. They introduce a Web machine, anal ogous to a Turing machine, and browser machine and browse/search machine variants. Their goal is to consider the computational characteristics of first-order logic, Datalog, and Datalog with negation in the context of the infinite Web model and the Web m achines. Interesting stuff.)

Other Primitives

Working from one of Dan's slides, there was a brief discussion of other primitives that might be useful to include in WebSQL. Dan suggested:

Page classification, for example, JPEG isA Image
HTML parsetree relationships
Filename regexpr matching in paths
Sorting

Query Locality

There was a more extended discussion of query locality along two lines: usefulness of the concept and efficiency issues. Query locality is an attempt to characterize the search space or domain of a query. The idea is as follows. Much of the informat ion contained in the structure of the Web is expressible in terms of path regular expressions between objects. One recurring example we have used: the graph structure of links between departmental homepages, personnel, and publications provides indicativ e information about which is which -- especially when used with probabilistic heuristics. So it would be quite useful to characterize the search space in terms of these path regular expressions. Using this strategy, we would have a mapping from the stru ctural queries posed, to the subnetwork searched, and so to the computational complexity of the query -- for a given query language. WebSQL attempts to develop such a mapping from path regular expressions to search space by distinguishing among different types of links: internal links (intra-document), local links (intra-server), and global links (inter-server). (Compare this issue to the preceding discussion of query completeness and computational complexity.)

There was mixed sentiment about the usefulness of the concept, but general agreement that the interior-local-global distinction was not quite the right parse of the idea. (The authors do note that extending the expressiveness of link types, possibly t o arbitrary link types, is one area of possible future work.) There was some discussion of: How do you do better? One idea was to parse on the URL string, for example, javasoft.sun.com versus sun.com as an indicator of locality.

In terms of query locality in the design of WebSQL, Alon made the point that query locality was built into the language too integrally. It might be more useful if query locality were an attachable, application-specific module that could be plugged int o the system. Another opinion along the same lines was that structural queries and textual queries should be orthogonal capabilities rather than tightly integrated.

There followed a general discussion of the merits of the paper cast, loosely speaking, in terms of the theorem-to-definition ratio of the paper. The paper had too many definitions -- maybe -- and not enough interesting theorems. The counterargument w as that, in the assessment of query languages, theorems do not get you much insight. The query languages are all subsets of Datalog -- which is well understood. (This discussion, too, echoed the earlier completeness-tractability theme.) But if the defi nitions are not interesting and the theorems are not interesting, what is interesting about the work? One answer: our next topic.

What is interesting about query language proposals is often the implementation and optimization issues. The big question is: How do you do this (Figure 2 from the paper, in our case) efficiently in a way that satisfies the semantics? The paper gives us the architecture of the system (Figure 2), but unfortunately, very little on the implementation and nothing on efficiency, including any real experimental results. (This counts as a major limitation of the paper.) It was noted that Stanford's Lore project (Lightweight Object Repository) and its query language, Lorel, indicate that the problem of query optimization for semistructured data is wide open. Among the Lore papers available at:

	http://www-db.stanford.edu/lore/

see for example

McHugh, et al. "Lore: A Database Management System for Semistructured Data"
Goldman and Widom. "DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases."

Utility of WebSQL, Related Work, and Successor Systems

The three possible uses of the system, presented in the paper's discussion section, were raised:

Selective Indexing
View Definition
Link Maintenance

Most of the discussion was about selective indexing and link maintenance. For selective indexing the strongest point in favor of the usefulness of WebSQL was the question: What is the alternative? Answer: write a program that does the same thing. Bu t with a declarative language, you only have to specify what you want, not how to do it, and, you only have to do it once. While a declarative language is the right tool, WebSQL may have the wrong primitives for the task. In particular, the internal-loc al-global link classification may not be useful. This led to a discussion of the usefulness of an "about" primitive, as in "What is this page about?" The consensus seemed to be that an about primitive was a potentially useful idea, but that it was also sema ntically tricky and subtle. Can "aboutness" of a page be determined from the path regular expression used to reach it? Maybe. But it is hard to determine exactly how complex the path regular expression parse would have to be to reliably determine the a boutness of a class of indexed objects. There is also the possibility of alot of false positives.

WebSQL was thought to be useful for link maintenance through a pretty straightforward application. But, Alon made the telling point that link maintenance is just one task in the more general application of website maintenance. This being the case, wh y not do it the right way from the beginning, for example, with Strudel?

Two other possible applications were considered: recommender systems and homepage finders. It was felt that WebSQL showed considerable promise as a recommender system. The advantage of the system is that it can make recommendations based on structura l relationships in addition to keyword vector similarity. It also permits rapid prototyping. One simple example, it could recommend all pages that point to a given page. Though WebSQL seemed promising as a homepage finder based on the combination of he uristics and the structural relationships between pages that match on the heuristics, it was generally felt that a serious shortcoming for this application domain was that WebSQL had no built-in probabilistic capability.

The final topic of discussion was possible successor systems. Here the most interesting possibility raised centered on communication between Strudel-like unmaterialized websites and search engines. Such sites could provide information about paths and structural relationships, among other things. This could be quite useful for structural queries such as those generated by WebSQL. It was noted, however, that if such sites are not materialized, there is no guarantee that the information they supply is true. They could, essentially, say yes to everything with the goal of generating traffic. The meeting broke up with a food-fight over Spam Strudel.

(02/19/98) Recommender Systems

The two papers we discussed on this Thursday, the GroupLens and the Hidden Web papers, deal with social filtering in the realm of recommender systems.

GroupLens

What is it?

GroupLens is a collaborative filtering system designed with netnews in mind. The idea in GroupLens is to find other users whose interests in news articles are correlated with your own. The assumption is that, if your and other users' interests in news articles have been correlated in the past, then your interests are likely to be correlated in the future. What is the useful result of this assumption? That a user can use other users' ratings of news articles to determine how likely that user will be interested in a particular article.

I'm tough. Show me some math

The ratings on news articles were on a scale from 1 to 5, where 5 indicated most liking the article. Below is a sample correlation matrix for 4 users across 6 news articles. The question at hand is, what is Ken's likelihood to like article 6?

Article Ken Lee Meg Nan

1 1 4 2 2

2 5 2 4 4

3 3

4 2 5 5

5 4 1 1

6 2 5

We need to compute the correlation coefficients r_x,y for each pair of people x,y. The formula for r_x,y is:

Using this formula, we can compute as r_K,L = -0.8 and as r_K,M = 1.0. Because we can use only other users for whom we have a score for article 6, only r_K,L and r_K,M are of interest. Combining these coefficients gives us:

for an ultimate answer of K₆ = 3.3.

Yeah, but will it work for more than 4 users?

All of the tests that the authors reported in the GroupLens paper were on user bases of size 4 - 6ish. While the correlation information that they found was impressive (that correlations in the past really did do a reasonable job of finding correlations in the future), it is not clear that such a system will scale to a user base of hundreds of thousands of netizens.

Some issues about scalability include quality of prediction and performance, both in computational and bandwidth required. On the quality of prediction front, the issue seems to be that there is a startup cost associated with using this system. That is, before any of the past users' ratings can be useful to you, you have to have rated some news articles yourself. How many? Does this number depend upon the number of users?

As for the performance issues, the worry is that the algorithm for computing correlation between users is quadratic in the number of users. For P (the number of people in the user base) being small, O(P²) isn't too bad. However, as P approaches the 6, 7, or 8 digit mark, O(P²) is far more daunting. In addition, even if the cost of computing the correlations is neglected, the cost of transporting this huge database of (user,article) ratings is excessive as the number of users (and even the number of articles) grows tremendously.

One possible solution to this performance problem would be to cluster users into archetypes -- canonical user profiles. Each user would belong to an archetype within the realm of one newsgroup (GroupLens splits the domain of interest at newsgroup boundaries). As users rate articles, their ratings would contribute to their archetype. To determine their predicted ratings, the correlation computation would be made on the AxA matrix of archetypes, not users. Thus, with a limited number of archetypes, the computational cost is fixed. Of course, not all is rosy with archetypes. The main problem is that archetypes are never fine grained enough to suit all users just right. Perhaps a weighted average of several archetypes would solve this problem?

Will somebody think of the children?

Even with issues of scalability aside, there are other concerns of this system, in particular, social implications. One of the foundations for the GroupLens system is that there are other users out there who have rated articles in the past and who have rated articles that you are about to read. What motivates those other users? Why would a user want to submit ratings for an article, or submit an accurate rating? If it's just as easy to click the Severely Dislike button all the time, what reason does a user have to not do so?

Another circumstance is new material. When a new piece of news is posted, it initially has no rating (except, perhaps, the author's rating of her own work). In order for there to be a rating on this article, someone will have to eventually read and rate it. But, what is the incentive to even consider reading an article that has no rating yet? What is a user's motivation not to simply sit on their thumbs waiting for someone else to rate the article first?

Perhaps one solution to the incentive problem would be to transact money somehow. One could imagine that submitting ratings would earn a user credit. Perhaps this credit could be spent by receiving ratings on articles that the user has yet to see. The credit could be higher for newer articles, and lower for articles for which many ratings already exist.

Hidden Web

It's not the kind of web you're thinking of...

The Hidden Web paper is all about data mining for the web of personal contacts that exist between people. The thought is that, if you have a question about some topic, you'd like to find an expert in that field. Instead of conducting a length search "by hand" (sending email to colleagues, then contacting their acquaintances, etc.), it would be nice to narrow that search down to just the right people.

How does it work?

The idea is as follows: starting with a single person (the one who has the question), a graph is built. Nodes in the graph represent people; edges represent contact between the adjacent nodes. The graph is built until a node is added that is considered to be an expert in the topic. Because the graph is always being grown by adding adjacent nodes, there is a path from the seed node to this expert node.

There are two immediate questions that come to mind with this algorithm: (1) how are the nodes found/edges established? (2) how is a node determined to be an expert?

To answer the first question, the algorithm uses a simple iterative graph growth algorithm. To grow the graph one level from node N, N's name is searched for at, say, Alta Vista. Then, all the names from the pages returned by the search engine are extracted. Using the ratio of (number of pages that share the names) : (number of pages that don't share the names), edges between N and these extracted names are established.

To determine if an individual N is an expert in a particular field, a similar technique to that above is used. Pages on which N's name appears are fetched. Then, all capitalized words are considered. A TFIDF metric is finally used to determine the correlation between N's name and words in the topic in question.