University of Washington
Department of Computer Science and Engineering

CSE574 – Notes from Class Discussion

Date Topic Scribe
1/8/98 Search Broker & Ahoy! Tessa Lau
1/13/98 STIR, WHIRL & SPIRAL Jason Staczek
1/15/98 Datalog, First-order Logic, and Description Logic examples Marc Friedman
1/20/98 Information Integration Systems: Razor, TSIMMIS, IM Rachel Pottinger
1/22/98 Wrapper Induction Brian Michalowski
1/27/98 ILA and Shopbot Steve Wolfman
1/29/98 Wrappers Continued Adam Carlson
2/3/98 Strudel
2/5/98 Constraints and the Web Dave Hsu
2/10/98 WebSQL Derrick Tucker
2/19/98 Recommender Systems Corin Anderson

(01/08/98) Search Broker; Ahoy!

Notes by Tessa Lau

We discussed two papers:

  1. The Search Broker, by Udi Manber and Peter A. Bigot (Search Broker)
  2. Dynamic Reference Sifting: A Case Study in the Homepage Domain, by Jonathan Shakes, Marc Langheinrich, and Oren Etzioni (Ahoy!)

Search Broker

The Search Broker provides a common interface to a number of diverse information databases. The databases are organized into a two-level hierarchy. Each query includes as its first word a topic selector, which maps into a single database that provides the best information for that query. SB then performs the following four steps:
  1. Match topic to database
  2. Translate query into format for search engine (db)
    Each db has a template:
  3. Send HTTP request (GET or POST) to search engine
  4. Present results to user

Strengths

  1. Incorporation of human input
  2. Access to the hidden web: dynamically-generated content
  3. High accuracy: librarians select the "best" info sources
  4. Large selection of databases covering many topics
  5. Simple wrappers (this is also a limitation)

Limitations

  1. Rudimentary topic selection
  2. Sensitivity to ontology or topic-db mapping
  3. Could use Bayes networks to disambiguate topic based on the query content
  4. Simple wrappers:
  5. Not parallel: search engines are queried one-by-one
  6. No ranking (or clustering) of results. However, the librarian has already chosen the best search engine for each topic, so you could argue that ranking isn't necessary.
  7. Resource limitations: a fixed two-level hierarchy might need to grow arbitrarily large as the web grows

Paper evaluation

Everyone agreed that the paper was lacking in analysis and evaluation of the system. In particular, there was no discussion of how well the system performed, nor was there justification for their design decisions. However, Dan was extremely interested in the description of the wrapper language.

Comments

Several comments arose comparing the Search Broker to Jango/Excite and regular search engines like Metacrawler and Alta Vista. I think the comparison is a bit unfair, since people have different kinds of information needs and each system addresses a different need.

Jango's architecture and use of complex wrappers allow it to support advanced queries and format the results into a table. This is great for comparison-type queries, where there are several "correct" answers to the query and the user is interested in a comparison of all of them.

The Search Broker, on the other hand, attempts to answer fact-queries where there is only one correct answer (e.g., how much fat is in pizza?). In this case comparisons aren't as useful and SB's approach of having only one database per topic is appropriate.


Ahoy!

The Ahoy! homepage finder is based on the idea of using domain dependent heuristics to maximize precision and recall by filtering the output of a generalized search engine with high recall. The architecture described in the paper is called Dynamic Reference Sifting, which has several components:
  1. Reference source (not necessarily comprehensive)
  2. Cross filter: filter pages by institution
  3. Heuristic filter: test for "homepage-ness"
  4. Buckets: rank pages based on correct name, correct institution, homepage-ness
  5. URL generator: synthesizes candidate URLs when previous steps fail
  6. URL pattern learner
What classes are appropriate for DRS? Examples of such classes are:

System evaluation

Strengths: the system provides both high recall and high precision. Precision is especially important because it reduces the need to scan hundreds of false positives. It's able to bootstrap itself, which means that it's useful even before it has gathered any training examples. It is able to incorporate domain-dependent filters such as the institutional cross-filter and a test for homepage-ness.

Limitations: slow; must wait for all search engines to complete. DRS is not applicable in every domain.

Paper evaluation

Strengths: the experiments are comprehensive and (IMHO) convincing.

Limitations: the precision experiments aren't quite fair, since they only consider the top-ranked result rather than the first page of results. Also, the comparison of recall among the different systems is bogus since Ahoy!'s recall is bounded by Metacrawler's recall.

Ahoy! == DRS: it's also not clear whether DRS is general or whether it only applies to homepage-finding.


(01/13/98) STIR, WHIRL & SPIRAL

 

Notes by Jason Staczek

We discussed one paper, A Web-based Information System that Reasons with Structured Collections of Text, by William W. Cohen, AT&T Research. Dan opened with some background on relational database operations, vector representation of text documents, and document similarity calculations. A fair bit of the discussion was spent contrasting Nick Kushmerick’s HLRT wrappers (Wrapper Induction for Information Extraction, Kushmerick et al.) with Cohen’s conversion programs.

The paper does a poor job of describing the user experience. Instead try a guided tour of the WHIRL system at Cohen’s home page.


Relational DB background

A relation is a fixed width data table, with each row (known as a tuple) consisting of values for named fields. Basic operations on a single relation are selection of tuples and projection (removal of fields or columns). Given two relations P and Q, their cartesian product is defined as:

P =

A

B

C

1

2

3

5

5

4

2

3

4

4

3

5

4

3

3

1

2

3

Q =

D

E

3

1

4

2

P ´ Q =

A

B

C

D

E

1

2

3

3

1

5

5

4

4

2

2

3

4

3

1

4

3

5

4

2

4

3

3

3

1

1

2

3

4

2

Two (or more) relations may be joined by applying a predicate to their cartesian product to select only those tuples which satisfy the condition.

P

´

Q =

C = D

A

B

C

D

E

1

2

3

3

1

5

5

4

4

2

2

3

4

4

2

4

3

3

3

1

1

2

3

3

1

Join notation may also be written in conjunctive form. The join described above would be:

T(A,B,C,D,E) :- P(A,B,C) Ù Q(D,E) Ù C = D

 

The predicate can be arbitrary (C < D, for example). Cohen describes a similarity operation on text fields (C~D) used to join relations from possibly unrelated sources.

 

Similarity Predicate in WHIRL

Given a dictionary of terms T, documents (or text fields in a tuple) can be represented as vectors v, w in N-dimensional space, where N is the number of unique terms in T. For any term t in v, vt = the magnitude of the component of v in the direction of term t. vt is longer if t is mentioned frequently in v (TF = term frequency, or number of times t is mentioned in v ). vt is shorter if t is mentioned frequently in other documents in the same column as v (IDF, or inverse document frequency). Cohen gives vt as basically:

((log(TFv,t) + 1 ) * log(IDFt)) / (normalization factor)

to guarantee 0 £ vt £ 1. The similarity between documents v and w is then given by their dot product:

SIM(v, w) = S tÎ T (vt wt)

Which is interpreted as the cosine of the angle between v and w. 0 £ SIM(v, w) £ 1, and will be large if the documents share many important terms. Note that the dictionary employs stemming algorithms to handle suffix and tense differences (run = running, etc.), and may ignore unimportant terms or stopwords (a, and, the, etc.)

 

WHIRL, STIR, SPIRAL

SPIRAL is an information collection and retrieval system that supports logical database queries across multiple unrelated data sources. The SPIRAL architecture consists of a database builder, and a query processor known as WHIRL (Word-based Heterogeneous Information Retrieval Logic).

Information extraction and database construction

A database for a SPIRAL domain is built by extracting a set of relations from multiple unrelated data sources. Relations are constructed automatically by applying conversion programs to HTML sources. Conversion programs take HTML parse trees as input and attempt to match paths in the tree with supplied patterns. When a pattern is matched, data is extracted from a leaf node and classified according to the conversion program. As an example, the wrapper:

html - body - ul - li as movielisting - b as moviename

traverses the input tree to locate paths that look like html - body - ul - li. Leaf nodes with HTML attribute b are placed in a relation called movielisting under the field moviename. Conversion programs are hand-coded, and Cohen’s claim is that they take an average of three to four minutes to develop. Data is stored in STIR (Simple Texts in Relations) format, or free text in each tuple field. The wrapper language has additional controls to handle non-HTML-path formatted data (<br>-based), an escape to Perl, and the ability to do multiple passes with different conversion programs.

There was some discussion about the expressiveness of this wrapper language (less the Perl escape) as compared to Nick Kushmerick’s HLRT. Consensus seemed to be that they were about equally expressive, but Cohen’s is better equipped to deal with nested data, unless the nesting structure is known up front. It was suggested that HLRT was probably better at avoiding false positives in head and tail areas that might catch SPIRAL.

Query processing

Queries are applied to the database through WHIRL, an extension of Datalog. Queries are conjunctions of relation selectors and similarity predicates. WHIRL performs a "soft join" on selected relations by substituting the notion of equality with a similarity metric as described above. Rather than selecting tuples which meet an equality condition, WHIRL returns tuples ranked by similarity score. When the query contains joins of more than two relations, the results are sorted by the product of the similarities of each join.

WHIRL uses an unspecified A* search mechanism to evaluate and return only the most promising tuples. It was shown to perform better on at least two databases than the so called naïve method of joining relations:

for each document in ith column of relation P

submit it as IR-ranked retrieval query to corpus corresponding to elements of jth column of relation Q (lookup uses inverted index)

save top r results

merge top r results from each iteration to find top k overall results

There was some speculation that WHIRL used IDF information to prune the search. There was an unresolved question about the asymptotic complexity of WHIRL compared to other optimization methods.

Strengths

Limitations

Comments


(01/15/98) Datalog, First-order Logic, and Description Logic examples

Datalog

Datalog is a function-free subset of prolog. Without negation, it is equivalent to Horn predicate logic. A datalog program is composed of rules, with a head (a relation), a ":-" symbol which is read as "if", and a body (a conjunction of relations). A relation is a function symbol, followed by some variable or constant arguments in parentheses. Any variable appearing in the head must appear in the body. For instance, we define paths in terms of edges in the datalog program:
path(X,Y) :- edge(X,Y).
path(X,Y) :- edge(X,Z) & path(Z,Y).

First-order Logic

For those familiar with first-order logic, the above program has a declarative semantics in FOL/English:
forall(X,Y)   ( edge(X,Y) => path(X,Y) ) AND
forall(X,Y,Z) ( edge(X,Z) ^ path(Z,Y) => path(X,Y) ).

Description Logic

Description logic is a different subset of first-order logic. In contrast to datalog,
  • There are no explicit variables.
  • There is negation.
  • There is limited disjunction.
An example (somewhat distorted by my weak HTML) is a query for papers written only by researchers whose authors include an American and a non-American:
Paper ^ (ALL Paper-Author . Researcher)
      ^ (EXISTS Paper-Author . American)
      ^ (EXISTS Paper-Author . ~American)
another example, queries for the papers with at least two authors who are people, not chihuahuas.
Paper ^ (>=2 Paper-Author) ^ (ALL Paper-Author . Person)

Predicates, relations, and queries

There are two kinds of predicates in datalog: those that are enumerated (the edge predicate is assumed to be enumerated somewhere, in a database) and those that are derived. In the database theory community (i.e., Jeff Ullman's students), the terms for them are EDBs (extentional database predicates) and IDBs (intentional database predicates). A conjunctive query is a datalog rule, defining a new query predicate (perforce an IDB) in terms of other predicates. A view also is defined in the same way as a query, though the term is usually used to indicate an IDB with extended existence and reuse, while a query is more often a metaphor for a one-shot deal.

Subsumption, query containment, and entailment

One question we can ask of our databases is, what are the ground facts (a.k.a. tuples) we can derive from them? For instance, what are the pairs (X,Y) such that path(X,Y)? This leads into the subject of evaluating a datalog query over a database, a subject which was studied to death by Ullman's group.

A more interesting question involves two queries. We have given two examples of queries in description logic in the previous section. The answers to the former will all be answers to the latter as well. So we say this the latter query subsumes the former. An analogous notion applies to pairs of datalog queries. Query Q1 contains query Q2 iff all the facts (tuples) returned by Q2 are always returned by Q1 regardless of the database . Equivalently, in FOL, FORALL (X,Y) Q2(X,Y) => Q1(X,Y). If a query contains another, then a containment mapping exists between them. A containment mapping is a substitution of variables and constants (of the containee) for the variables of the container, such that each conjunct of the containee appears in the substituted container.

Answering queries using views

Suppose you have a set of EDB predicates in which you understand the universe. You have some WWW form-based interfaces to databases, v1 through vn, which are materialized views (IDBs) of these EDBs. Then suppose you get a query, defined solely in terms of the EDBs. Levy, Rajaraman, and Ordille give an algorithm to find a query defined solely in terms of the views that returns only answers to the original query, perhaps all the answers. This query, in contrast to the original, can actually be run.
Input is Q() :- q1(), q2(), ..., qn().
For each qi, find any view vj which is relevant to qi.
Substitute some combination of vj's for qi's to form new query Q'().
For each view in Q', substitute its definition in terms of EDBs, to get Q''().
Check that Q'' is contained in Q.
Repeat for all combinations.
This algorithm is exponential in the length of the input query, for two reasons. First, it must try all combinations of views relevant to each conjunct. Second, the inner loop does containment mapping, which is exponential in the number of repeated predicate symbols in the query.

Miscellaneous

Also discussed were capabilities of a data source. An ordinary "traditional relational" database allows you to download all facts in the whole data relation, or query it on equality with any field. Negative capabilities would imply a more restrictive input/output scheme. Positive capabilities would include SQL or other more complex input, to be processed by the source.


(01/20/98) Information Integration Systems: Razor, TSIMMIS, IM

  • The paper has two bugs. The first is right before equation 11 on the page with section 5: the call to Contains with C should be a call to Contains with C’. The second is in equation 12: there should be no primes.
  • Whether the arrows should be single or double edged turned out to be related to what the notation in datalog means. The datalog expression p(x,y) :- e(x,y) logically translates to " x,y e(x,y)->p(x,y) also means p(x,y)->e(x,y).
  • If there is more than one such rule, then the inverse involves taking the union
  • For the purpose of this paper, if it says implies, it only means implies, not implying the converse
  • This is acceptable because, among other reasons, it should never be assumed that anything is complete.
  • In a web system, it is important to be able to express both:
  • Web site x has reviews for movies
  • Web site y has all reviews for movie z
  • Razor is derived from IM (Information Manifold).
  • The answer to how you take a query in world relation and create a plan to query databases is described in detail in O. Duschka and A. Levy. Recursive plans for information gathering. In Proceedings of the 15th International Joint Conference on AI, 1997.
  • In an example of trying to find reviews of all of the movies starring Harvey Keitel that were playing in Seattle after finding the name of movies in the Internet Movie Database, there are two different places that the user can go for reviews; Ebert, which provides only reviews by Roger Ebert, and Movie Link which provides, while not an exhaustive list, certainly a large subset of them. The Razor paper describes three different methods of finding information that is assumed to be "locally complete." This term means that it is guaranteed to find all of the information that is available given only the sites that they went to. All involve the notion of subsuming.
  • A source X is said to subsume another source Y if all of the information stored in Y can also be found in source X. For example SABRE subsumes United, because it contains all of the information about all of the flights on United, but it does not subsume Southwest, because it does not provide information about Southwest.
  • Information sources that are subsumed by another should not be gotten rid of, however, in case the subsuming site is unavailable. Thus, there are (at least) three possible execution policies, all of which become more interesting if there are more (resource) limits than are currently imposed on web systems:
  1. Brute force – Just ignore subsumption, and execute everything greedily. This method annoys both servers because of the large amount of traffic that is generated and clients because the clients then have to wait for and sift through all of the information from all of the sites, even if it is redundant.
  2. Aggressive – Execute both alternatives in parallel, but cancels all communication with B once A has successfully returned. On the web this does not make life any better for the server because all of processing has to be done anyway, and only the bandwidth for returning the information is saved. However, it does aid the client because the client then does not have to go through duplicate data.
  3. Frugal – Initially, run only A, if A fails, then run B. On the web, this method currently does not have many advantages over aggressive from the client side; the benefit is only in saving one call, but that really doesn’t make that much of a difference even on computers with low bandwidth. If there was a charge for accessing information, then it would make sense, but that’s not likely to happen "for at least the next 12 months" [Weld 98]. On the server side this is obviously preferable because the server doesn’t have to service useless calls.
  • The advantages of the frugal system are less likely to be seen because people don’t like to wait. Some suggested that this can be ameliorated by showing that much progress has been made, or that at least something is going on. The question of whether users can be expected to wait for more complex queries was also raised.
  • From the WHIRL paper, we know that joining is a very difficult operation, and the question was raised as to how well Razor handled it. The answer was "not well."
  • Sources either needed to be rigid or there needed to be really good wrappers.
  • The current system attempts to normalize their response, but sometimes it fails. One member of the project team was heard to say, "It sucks."
  • In a related note, the question of how the system would deal with identical articles was also brought up. The answer was that if both articles were exactly the same, including the representation of the author’s name, they would probably be recognized as the same and not duplicated. Otherwise they would probably be duplicated. Whether or not the database community has handled these sorts of situations (an author listed as J. Smith in one table and John Smith in another) was raised, and the answer was that it has not been looked into very much; perhaps WHIRL is the best thus far. The methods thus far have all been very domain specific, there is no domain independent way to do so (this was mentioned to be a good project idea). One suggestion was to use probabilistic information about two objects to see if they are equal.
  • In general systems tend to do a bad job of searching different forms on the same database. If there are two different access patterns to the same data, then two forms are needed.
  • Marc said that one difficulty in building the system is that if the system you are relying on to get your data from gets smarter, then your system must become smarter in order to avoid obsolescence. Dan and others disagreed saying that the system would still make it so that only one web site would have to be searched
  • The limits of local completeness as implemented in the Razor system were counted to be the following:
  • No negation; for example you cannot ask for all of the reviews except those by Ebert.
  • No disjunction on the right hand side.
  • No way to express that the union of two sources subsumes another source.
  • No less than, equality (where equality refers to equality on the same variable) or numbers.
  • While the right hand side can refer to sources and many other things, same as the world ontology, the left hand side can only be the source.
  • It is impossible to represent the idea that two data sets are disjoint.
  • Dan raised the question of whether local completeness is useful. The answer degenerated into a discussion of its limitations. However, it was mentioned that it is helpful in situations where you’re interested in making sure you have "all" of the information, and with local completeness you are assured of knowing when to stop.
  • Do restrictions on a subject make it easier or harder? For example, is it easier to say that there is a database that has all information on cars, or a database that has all information on American cars manufactured between 1972 and 1983? This is important for figuring out which information is irrelevant and if one source subsumes another. On the web, however, you rarely want to say that something contains "all" information because it’s hard to believe that anything is exhaustive.
  • The paper contained some illustrations about the way that information was gathered and joined together, and the question was what is the relationship between that representation and a datalog query. The answer was that it was the same, but the graphical version was easier for the average person to understand.
  • In the picture, no loops corresponded to the datalog statement not being recursive. An example of where you might run into a loop was that you wanted all of the ancestors of Harvey Keitel and had a database that would list parents of a person. You would enter Mr. Keitel, get his parents out of the system, feed them back in, get their parents, etc.
  • The graphs and the local completeness they represent are not unique. They are all equivalent in output, but not in speed; this aspect is not addressed in the paper.
  • In practice, whenever you get a loop, it often turns out that you really don’t want it.
  • If this technology was portable and integrated (taking the best from Whirl, Tsimmis, Razor, etc) would people want it?
  • It isn’t domain specific which would make it more desirable.
  • Some suggested applications were books and cars.
  • The "killer app" for the systems, however, is not the web but companies with large numbers of large databases. They often have no idea what is in them or how to relate the data in one to the data in another regardless of whether or not the data is already in the same format across the databases.

 

A comparison of the different systems:

  • Site oriented - Both IM and Razor are site oriented, while Tsimmis is not. By site oriented I mean that it is easy to add new sources easily by simply describing what it has. This involves a tradeoff between the speed that the system has at runtime and how easy it is to add new databases. In a static world where the databases and their format rarely changed, it would be considerably more sensible to use Tsimmis, however on the web, a more site oriented system is probably desirable.
  • Completeness - can you get all of the information based on what the sources had. In IM it is only possible if you are using the version with Duschka’s algorithm. This is because there were several iterations of IM, and not all of them were complete. Tsimmis was not, and Razor was.
  • Local completeness reasoning - You can’t express it in Tsimmis, but both IM and Razor use it.
  • Other separating ideas are term definitions, interpreted predicates, word source and non-relational data.

 

Finally, the Alon Levy magical mystery list:

 

  1. Relevance of a source.
  2. Redundancy of a source.
  3. Irrelevance of a source.
  4. Unusability of a source.


(01/22/98) Wrapper Induction

This paper demonstrates machine learning techniques for automatically construct wrappers from examples. Automatically learning wrappers is useful since wrappers are usually hand-coded and are thus expensive to create and maintain.

These techniques are for learning wrapper procedures. To this end, wrappers are encoded using an LR-like language, where each attribute ki has delimiters li and ri indicating the beginning and end of that attribute. For example, in

Some country codes<P>
<B>Congo</B> <I>242</I><BR>
<B>Egypt</B> <I>20</I><BR>
<B>Belize</B> <I>501</I><BR>
<B>Spain</B> <I>34</I><BR>
attribute k1 is country name, which could have "<B>" as l1 and "</B>" as r1. The advantage of using this language is that to learn a wrapper procedure, the system need only learn values for <l1, r1, ..., lk, rk>. This is one of the main strengths of this paper.

Note that the values of l and r are not unique. For example, ">cr<B>", where cr is a carriage return, is another possible choice for l1. One purported advantage of this representation for wrappers is that delimiters do not need to be HTML tags, and might have nothing to do with HTML. However, members of the class were doubtful that the l and r delimiters would frequently be anything other than HTML tags.

To generate a wrapper for a new domain, the system presented in this paper goes through three steps:

  1. Gather enough pages to satisfy the termination condition (PAC model)
  2. Label example pages
  3. Find an LR wrapper consistent with the examples
To stick to the sequence these topics were presented in in class, we'll discuss the third step, then the first, then the second.

Finding a wrapper

The naive approach to learning the values of <l1, r1, ..., lk, rk> would be to try all of the possibilities, which would run in O(S2K), where S is the length of the shortest example and K is the number of attributes. However, by assuming that the attributes are independent we can reduce the running time to O(KS).

The algorithm is fairly straightforward. To find an l delimiter, look for the longest common prefix, and for the r delimeter, look for the longest common suffix. For example, to find r1 in the example, the longest common suffix would be:

        <B>Congo</B> <I>242</I><BR>
        <B>Egypt</B> <I>20</I><BR>
There was some debate in class whether it was better to favor the longest or the shortest possible delimiters. It was pointed out that this decision is based on whether you prefer to have false negatives or false positives, respectively.

LR isn't general enough to describe all the wrappers one might want. For example, if "<B>" is used as an l delimiter, but also appears in the title, then that part of the title will incorrectly be considered an example. HLRT compensates for this by introducing h (head), a section at the beginning of the page to skip over, and t (tail), a section at the end of the page which is also skipped. Thus, an HLRT wrapper is described by:

<h, l1, r1, ..., lk, rk, t>
Things get trickier because l1, t, and h interact. This increases the running time of the algorithm to O(K S2), which is still better than the naive approach's running time of O(S2K+2).

The PAC Model

The learning algorithm terminates using a PAC (probably approximately correct) model, that is, with high probability, the wrapper learned is highly accurate. That is, if E(h) is the probability that hypothesis h is wrong on a single instance selected randomly, the algorithm stops when:
Prob(E(h) > eps) < delta

eps is known as the accuracy parameter, and delta is the confidence parameter.

Once you solve the yucky math, you discover that the predicted number of pages the algorithm must use as training examples is independent of the number of examples, linear in 1/eps, logarithmic in 1/delta, and logarithmic in S. However, the PAC model prediction is way too high; the system does far better in practice. It's not clear why the PAC model is so high. It could be because the pages aren't independent, which the model assumes, or that the bounds need to be tightened. The class didn't reach a consensus.

Labeling example pages

Labeling example pages relies on recognizers, which recognize instances of a particular attribute, such as country names or country codes. They may be perfect, unsound (produce false positives), incomplete (produce some false negatives), or unreliable (produce false positives and negatives.) Note that even with perfect recognizers wrappers may be needed, since wrappers must be fast while recognizers might not be. As long as one perfect recognizer and no unreliable recognizers are used, the corroboration algorithm used in the paper will succeed. If an example has missing attributes, the HLRT wrapper learner throws out that example (when training for that attribute, anyway,) and if there are unsound recognizers, the learner branches on those attributes.

Analysis

Wrapper induction appears to be practical. It took the system about 1 CPU minute to learn each wrapper, requiring 4-44 pages to get 100% accuracy. The system is very robust - can have crummy recognizers and still not need many pages. Finally, it was determined that HLRT and SPIRAL are incomparable. SPIRAL can't handle non-HTML text, while HLRT can't handle hierarchical structures.


(01/27/98) ILA and Shopbot

Papers:

  1. Category Translation: Learning to Understand Information on the Internet, Perkowitz & Etzioni, IJCAI-95
  2. A Scalable Comparison-Shopping Agent for the World Wide Web, Doorenbos, Etzioni & Weld, 1997 Conference on Autonomous Agents (AGENTS-97), p39.


Auto Modeling of Internet Resources

  • How represent? (i.e., declarative modelling as in IM/Razor)
  1. Info content (or effect on world)
  2. Capabilities (binding patterns, remote join)
  3. Quality
  • How learn models automatically? (as in ILA)
  1. Discovery
  2. Protocol
  3. Semantics
  4. Quality

 

Possible project idea: Survey of models of Internet resources; which is best (for what purposes)?

 

ILA describes several subproblems of modeling (as above):

(Italicized problems are under-researched and good project opportunities)

  • Discovery: How to find new sources?
  • In the CD domain, might train on known CD stores to classify them; then, watch "what's new" lists for new CD stores.
  • Protocol: How to issue a query (addressed in ShopBot) and process results (wrappers)?
  • Semantics: Interpreting the meaning of the page
  • Quality: How to know how good info is: rating a site as by
  • Breadth (local completeness)
  • Accuracy
  • Speed
  • Availability
  • Maximization of some domain-specific quantity (i.e., savings on price) which may rely on semantics, user input, or past queries


Category Translation

  • Given
  • Incomplete internal world model

Some records may not have all fields; some records may be missing; this knowledge may be discovered in the learning process; however, it is not possible to learn new attributes for objects.

    • Objects
    • Attributes
  • External Info Source
    • k functions from string to ?
  • Determine
  • Query: lastname(person)
  • Response:
    • firstfield = firstname(person)
    • 4th field = mail-stop(department(person))


Correspondence Heuristic - the key thing that makes ILA work. Assumption (similar to Kushmerick's) that the format of a query's response is always the same. In contrast with Kushmerick's system, if ILA knows for sure that field 2 is the name, it's set. ILA can't be sure of the matches (semantics); HLRT can't be sure of the placement of tokens.