Date | Topic | Scribe |
1/8/98 | Search Broker & Ahoy! | Tessa Lau |
1/13/98 | STIR, WHIRL & SPIRAL | Jason Staczek |
1/15/98 | Datalog, First-order Logic, and Description Logic examples | Marc Friedman |
1/20/98 | Information Integration Systems: Razor, TSIMMIS, IM | Rachel Pottinger |
1/22/98 | Wrapper Induction | Brian Michalowski |
1/27/98 | ILA and Shopbot | Steve Wolfman |
1/29/98 | Wrappers Continued | Adam Carlson |
2/3/98 | Strudel | |
2/5/98 | Constraints and the Web | Dave Hsu |
2/10/98 | WebSQL | Derrick Tucker |
2/19/98 | Recommender Systems | Corin Anderson |
Notes by Tessa Lau
We discussed two papers:
Jango's architecture and use of complex wrappers allow it to support advanced
queries and format the results into a table. This is great for
comparison-type queries, where there are several "correct" answers to the
query and the user is interested in a comparison of all of them.
The Search Broker, on the other hand, attempts to answer fact-queries where
there is only one correct answer (e.g., how much fat is in pizza?). In this
case comparisons aren't as useful and SB's approach of having only one
database per topic is appropriate.
Limitations: slow; must wait for all search engines to complete. DRS
is not applicable in every domain.
Limitations: the precision experiments aren't quite fair, since they
only consider the top-ranked result rather than the first page of results.
Also, the comparison of recall among the different systems is bogus since
Ahoy!'s recall is bounded by Metacrawler's recall.
Ahoy! == DRS: it's also not clear whether DRS is general or whether it only
applies to homepage-finding.
Notes by We discussed one paper, A Web-based Information System that Reasons with Structured Collections of Text, by William W. Cohen, AT&T Research. Dan opened with some background on relational database operations, vector representation of text documents, and document similarity calculations. A fair bit of the discussion was spent contrasting Nick Kushmerick’s HLRT wrappers (Wrapper Induction for Information Extraction, Kushmerick et al.) with Cohen’s conversion programs. The paper does a poor job of describing the user experience. Instead try a guided tour of the WHIRL system at Cohen’s home page. (01/08/98) Search Broker; Ahoy!
Search Broker
The Search Broker provides a common interface to a number of diverse
information databases. The databases are organized into a two-level
hierarchy. Each query includes as its first word a topic selector, which maps
into a single database that provides the best information for that query. SB
then performs the following four steps:
Each db has a template:
Strengths
Limitations
Paper evaluation
Everyone agreed that the paper was lacking in analysis and evaluation of the
system. In particular, there was no discussion of how well the system
performed, nor was there justification for their design decisions. However,
Dan was extremely interested in the description of the wrapper language.
Comments
Several comments arose comparing the Search Broker to Jango/Excite and regular
search engines like Metacrawler and Alta Vista. I think the comparison is a
bit unfair, since people have different kinds of information needs and each
system addresses a different need.
Ahoy!
The Ahoy! homepage finder is based on the idea of using domain dependent
heuristics to maximize precision and recall by filtering the output of a
generalized search engine with high recall. The architecture described in the
paper is called Dynamic Reference Sifting, which has several components:
What classes are appropriate for DRS?
Examples of such classes are:
System evaluation
Strengths: the system provides both high recall and high precision.
Precision is especially important because it reduces the need to scan hundreds
of false positives. It's able to bootstrap itself, which means that it's
useful even before it has gathered any training examples. It is able to
incorporate domain-dependent filters such as the institutional cross-filter
and a test for homepage-ness.
Paper evaluation
Strengths: the experiments are comprehensive and (IMHO)
convincing.
(01/13/98) STIR, WHIRL & SPIRAL
A relation is a fixed width data table, with each row (known as a tuple) consisting of values for named fields. Basic operations on a single relation are selection of tuples and projection (removal of fields or columns). Given two relations P and Q, their cartesian product is defined as:
P =
A |
B |
C |
1 |
2 |
3 |
5 |
5 |
4 |
2 |
3 |
4 |
4 |
3 |
5 |
4 |
3 |
3 |
1 |
2 |
3 |
Q =
D |
E |
3 |
1 |
4 |
2 |
P ´ Q =
A |
B |
C |
D |
E |
1 |
2 |
3 |
3 |
1 |
5 |
5 |
4 |
4 |
2 |
2 |
3 |
4 |
3 |
1 |
4 |
3 |
5 |
4 |
2 |
4 |
3 |
3 |
3 |
1 |
1 |
2 |
3 |
4 |
2 |
Two (or more) relations may be joined by applying a predicate to their cartesian product to select only those tuples which satisfy the condition.
P |
´ |
Q = |
C = D |
||
A |
B |
C |
D |
E |
1 |
2 |
3 |
3 |
1 |
5 |
5 |
4 |
4 |
2 |
2 |
3 |
4 |
4 |
2 |
4 |
3 |
3 |
3 |
1 |
1 |
2 |
3 |
3 |
1 |
Join notation may also be written in conjunctive form. The join described above would be:
T(A,B,C,D,E) :- P(A,B,C) Ù Q(D,E) Ù C = D
The predicate can be arbitrary (C < D, for example). Cohen describes a similarity operation on text fields (C~D) used to join relations from possibly unrelated sources.
Given a dictionary of terms T, documents (or text fields in a tuple) can be represented as vectors v, w in N-dimensional space, where N is the number of unique terms in T. For any term t in v, vt = the magnitude of the component of v in the direction of term t. vt is longer if t is mentioned frequently in v (TF = term frequency, or number of times t is mentioned in v ). vt is shorter if t is mentioned frequently in other documents in the same column as v (IDF, or inverse document frequency). Cohen gives vt as basically:
((log(TFv,t) + 1 ) * log(IDFt)) / (normalization factor)
to guarantee 0 £ vt £ 1. The similarity between documents v and w is then given by their dot product:
SIM(v, w) = S tÎ T (vt wt)
Which is interpreted as the cosine of the angle between v and w. 0 £ SIM(v, w) £ 1, and will be large if the documents share many important terms. Note that the dictionary employs stemming algorithms to handle suffix and tense differences (run = running, etc.), and may ignore unimportant terms or stopwords (a, and, the, etc.)
SPIRAL is an information collection and retrieval system that supports logical database queries across multiple unrelated data sources. The SPIRAL architecture consists of a database builder, and a query processor known as WHIRL (Word-based Heterogeneous Information Retrieval Logic).
Information extraction and database construction
A database for a SPIRAL domain is built by extracting a set of relations from multiple unrelated data sources. Relations are constructed automatically by applying conversion programs to HTML sources. Conversion programs take HTML parse trees as input and attempt to match paths in the tree with supplied patterns. When a pattern is matched, data is extracted from a leaf node and classified according to the conversion program. As an example, the wrapper:
html - body - ul - li as movielisting - b as moviename
traverses the input tree to locate paths that look like
html - body - ul - li. Leaf nodes with HTML attribute b are placed in a relation called movielisting under the field moviename. Conversion programs are hand-coded, and Cohen’s claim is that they take an average of three to four minutes to develop. Data is stored in STIR (Simple Texts in Relations) format, or free text in each tuple field. The wrapper language has additional controls to handle non-HTML-path formatted data (<br>-based), an escape to Perl, and the ability to do multiple passes with different conversion programs.There was some discussion about the expressiveness of this wrapper language (less the Perl escape) as compared to Nick Kushmerick’s HLRT. Consensus seemed to be that they were about equally expressive, but Cohen’s is better equipped to deal with nested data, unless the nesting structure is known up front. It was suggested that HLRT was probably better at avoiding false positives in head and tail areas that might catch SPIRAL.
Query processing
Queries are applied to the database through WHIRL, an extension of Datalog. Queries are conjunctions of relation selectors and similarity predicates. WHIRL performs a "soft join" on selected relations by substituting the notion of equality with a similarity metric as described above. Rather than selecting tuples which meet an equality condition, WHIRL returns tuples ranked by similarity score. When the query contains joins of more than two relations, the results are sorted by the product of the similarities of each join.
WHIRL uses an unspecified A* search mechanism to evaluate and return only the most promising tuples. It was shown to perform better on at least two databases than the so called naïve method of joining relations:
for each document in ith column of relation P
submit it as IR-ranked retrieval query to corpus corresponding to elements of jth column of relation Q (lookup uses inverted index)
save top r results
merge top r results from each iteration to find top k overall results
There was some speculation that WHIRL used IDF information to prune the search. There was an unresolved question about the asymptotic complexity of WHIRL compared to other optimization methods.
A more interesting question involves two queries. We have given two
examples of queries in description logic in the previous section. The
answers to the former will all be answers to the latter as well. So
we say this the latter query subsumes the former.
An analogous notion applies to pairs of datalog queries. Query Q1
contains query Q2 iff all the facts (tuples) returned by
Q2 are always returned by Q1 regardless of the database .
Equivalently, in FOL, FORALL (X,Y) Q2(X,Y) => Q1(X,Y).
If a query contains another, then a containment mapping
exists between them. A containment mapping is a substitution of
variables and constants (of the containee) for the variables of
the container, such that each conjunct of the containee appears
in the substituted container.
A comparison of the different systems: Finally, the Alon Levy magical mystery list:
(Italicized problems are under-researched and good project opportunities)(01/15/98) Datalog, First-order Logic, and Description Logic examples
Datalog
Datalog is a function-free subset of prolog. Without negation,
it is equivalent to Horn predicate logic. A datalog program
is composed of rules, with a head (a relation), a ":-" symbol
which is read as "if", and a body (a conjunction of relations).
A relation is a function symbol, followed by some variable or
constant arguments in parentheses. Any variable appearing in the
head must appear in the body. For instance, we define paths
in terms of edges in the datalog program:
path(X,Y) :- edge(X,Y).
path(X,Y) :- edge(X,Z) & path(Z,Y).
First-order Logic
For those familiar with first-order logic, the above program
has a declarative semantics in FOL/English:
forall(X,Y) ( edge(X,Y) => path(X,Y) ) AND
forall(X,Y,Z) ( edge(X,Z) ^ path(Z,Y) => path(X,Y) ).
Description Logic
Description logic is a different subset of first-order logic.
In contrast to datalog,
An example (somewhat distorted by my weak HTML) is a
query for papers written only by researchers whose authors
include an American and a non-American:
Paper ^ (ALL Paper-Author . Researcher)
^ (EXISTS Paper-Author . American)
^ (EXISTS Paper-Author . ~American)
another example, queries for the papers with at least two authors
who are people, not chihuahuas.
Paper ^ (>=2 Paper-Author) ^ (ALL Paper-Author . Person)
Predicates, relations, and queries
There are two kinds of predicates in datalog: those that are
enumerated (the edge predicate is assumed to be enumerated
somewhere, in a database) and those that are derived. In the database
theory community (i.e., Jeff Ullman's students), the terms for them
are EDBs (extentional database predicates) and IDBs (intentional
database predicates). A conjunctive query is a datalog rule, defining
a new query predicate (perforce an IDB) in terms of other predicates.
A view also is defined in the same way as a query, though the term
is usually used to indicate an IDB with extended existence and reuse,
while a query is more often a metaphor for a one-shot deal.
Subsumption, query containment, and entailment
One question we can ask of our databases is, what are the ground
facts (a.k.a. tuples) we can derive from them? For instance,
what are the pairs (X,Y) such that path(X,Y)? This leads into the
subject of evaluating a datalog query over a database, a subject which
was studied to death by Ullman's group. Answering queries using views
Suppose you have a set of EDB predicates in which you understand the
universe. You have some WWW form-based interfaces to databases, v1
through vn, which are materialized views (IDBs) of these EDBs. Then
suppose you get a query, defined solely in terms of the EDBs.
Levy, Rajaraman, and Ordille give an algorithm to find a query
defined solely in terms of the views that returns only answers to
the original query, perhaps all the answers. This query, in contrast
to the original, can actually be run.
Input is Q() :- q1(), q2(), ..., qn().
For each qi, find any view vj which is relevant to qi.
Substitute some combination of vj's for qi's to form new query Q'().
For each view in Q', substitute its definition in terms of EDBs, to get Q''().
Check that Q'' is contained in Q.
Repeat for all combinations.
This algorithm is exponential in the length of the input query, for two
reasons. First, it must try all combinations of views relevant to each
conjunct. Second, the inner loop does containment mapping, which is
exponential in the number of repeated predicate symbols in the query.
Miscellaneous
Also discussed were capabilities of a data source. An
ordinary "traditional relational" database allows you to download all
facts in the whole data relation, or query it on equality with any
field. Negative capabilities would imply a more restrictive
input/output scheme. Positive capabilities would include SQL or other
more complex input, to be processed by the source.
(01/20/98) Information Integration Systems: Razor, TSIMMIS, IM
Category Translation
Some records may not have all fields; some records may be missing; this knowledge may be discovered in the learning process; however, it is not possible to learn new attributes for objects.
Correspondence Heuristic
- the key thing that makes ILA work. Assumption (similar to Kushmerick's) that the format of a query's response is always the same. In contrast with Kushmerick's system, if ILA knows for sure that field 2 is the name, it's set. ILA can't be sure of the matches (semantics); HLRT can't be sure of the placement of tokens.
Question: Is the model mapping between field position and attribute?
Answer: Between field position and some composition of attributes.
Question: Does this require machine learning?
Answer: Might be done with exhaustive search, but even that would be a type of machine learning.
ILA
Problem: don't know which fields will matter, particularly with composition
Problem: prices and varying data do not fit into this system because it is based on pure matching recognition
Problem: how to handle multiple results?
ILA Solution: check each result, process against given query separately. Generally throws away irrelevant results on multiple results. This policy could cause puns.
Question: if field goes from object -> token, how can they be composed?
Answer: some fields go from object->object; others go from object -> token. All chains must end in a token.
Problem: loops in the field references (Bob is ( student (advisor Bob))).
ILA Solution: Breadth first search won't get caught in loops. Also any shorter hypothesis will be reached before longer ones (like (last-name (student (advisor Bob))))
Problem: The BFS solution imposes an arbitrary depth bound; however, the "length" of a chain is dependent on the ontology used; changing that ontology could move valid chains in and out of the depth bound.
This is part of the general difficulties inherent in logically formulating knowledge. Even in a simple world it is hard to know what the "right" structure is and what depth bound is appropriate.
Comment: this system seems to put most of its complexity into the construction of its ontology
Response: no, the semantic translation is still a difficult (and interesting) problem
Comment: ILA's ontology affects the format in which final responses return. In other words, only ILA's view is represented in the final response.
Evaluation
Note that depth bounds do not affect the given formulae because they are based on pairwise hypotheses that have already been found.
Comment: ILA seemed general at first, but implementation requires a lot of narrow, special-purpose coding (ontology).
Response: ILA is meant to be used in situations where a small amount of knowledge has already been collected. It's the bootstrapping that's central to developing new domains.
ShopBot basics (similar to ILA)
Comment: low-tech approach that worked (low-lying fruit)
Comment: Is there any deep approach in ShopBot? May be masked by an aversion to detail.
Environmental regularities
- Dan really likes this part: ShopBot leverages regularities that need not exist but do; this technique is drawn from the robot agent world.
Question: Will standards make this obsolete?
Answer: Probably not any time soon. The difficulty with standards is that everyone wants a standard that favors their product. Example: clothing. Haven't even standardized sizing, how will they standardize everything else?
Comment: ShopBot drives some of its searchees to its standard.
Comment: BargainFinder's antagonism got it locked out of some sites.
Learning a proto-wrapper
Cool stuff: Failure checking. Tries a bogus search first to see what failure looks like.
Evaluation
Question: How does it parse tuples?
Answer: Domain-specific heuristics.
Question: Can ShopBot get company stores?
Answer: No, "common" software not at any particular software company (i.e., Intuit doesn't have Exchange, one of the packages used to learn sites)