The CSE 592 project will be to implement an Intelligent Internet Information Integration System in Java. We expect people to work in teams of 3 people (plus or minus one). Papers 6.1 and 6.2 contain important information, so be sure to read them carefully. Note that this project description is intended as a starting point - groups are encouraged to add additional features beyond what we describe herein. Alternatively, groups are welcome to propose an alternative project (with milestones and timetable) if they wish.
CONTENTSMotivation
MOTIVATIONThanks to the Internet, thousands of structured information sources are available for querying, and the number and variety of these sites is growing rapidly. While a wide range of questions can be answered via the Internet, the morass of sources means that users cannot easily get the information they need. Humans face three problems when trying to gather information. First, they must determine which of the myriad sites has information relevant to their question. Second, they must learn to navigate the sites' idiosyncratic interfaces. Third, for many queries they must integrate the data returned by several different sites.
As your class project, you will construct a simple version of an information integration agent. In order to make progress we must answer the following questions:
DATALOGWe will use a case-insensitive dialect of datalog (i.e. a logical language which is a subset of first order predicate calculus, disallowing functions), as a knowledge representation language for the input and output of the system. If this doesn't make sense to you, please reread section 2 of paper 6.2. The main components of datalog are relations and rules.
Relations are the nouns of datalog. A relation has a case-insensitive name by which it is uniquely identified, followed by a list of arguments in parentheses.
Rules are the sentences of datalog. Each rule has three parts: head, connector, and body. A rule can be thought of as an operational definition of the head in terms of the body. For example, in the following rule
Safe rules.
We require that all rules be "safe," i.e. they obey the constraint that
all variables occurring in the head occur in the body as well.
You will be writing the rules for information source descriptions and will need to find and describe sources on the web which supply information about the world relations we provide. We will make available rules written by other groups so they can be used by everyone.
Input: Query: The query is a definition of the results of an interaction with the system, which will be in table format. So it is a new relation, defined in terms of other relations. We will only be supporting conjunctive queries over correct but incomplete data sources.
Output: Recursive
Plans The planning component of the system will output a recursive
plan, which is an executable datalog program for solving the query called
a recursive plan. The plan contains all reasonable sequences of source
accesses that can result in an answer to the query. Reasonable sequences
respect the types given above, and limit themselves to data in the sources and the
query.
A recursive plan only
uses the <= connector. There are no annotations ($). There are no declarations.
// Query
Query brando(M,U)
<=>
actor-of(M,"Marlon Brando",R);
review-of(M,U);
showing-in-seattle(M).
// Source descriptions
InternetMovieDB1(M,$PN,Y,R)
=> actor-of(M,PN,R); year-of(M,Y).
InternetMovieDB2($M,PN,R)
=> actor-of(M,PN,R).
LocalListings(M) =>
showing-in-seattle(M).
EbertReviews($M,U)
=> review-of(M,U).
TeenMovieCritic(M,U)
=> review-of(M,U).
// Recursive plan for
brando query
// Commas are required
in expressions.
brando(M,U) <=
actor-of(M,"Marlon Brando",R);
review-of(M,U);
showing-in-seattle(M).
actor-of(M,PN,R)
<= Person-name(PN),InternetMovieDB1(M,PN,Y,R).
year-of(M,Y) <=
Person-name(PN),InternetMovieDB1(M,PN,Y,R).
actor-of(M,PN,R)
<= Movie(M),InternetMovieDB2(M,PN,R).
showing-in-seattle(M)
<= LocalListings(M).
review-of(M,U) <=
Movie(M),EbertReviews(M,U).
review-of(M,U) <=
TeenMovieCritic(M,U).
Movie(M) <= (Person-name(PN),InternetMovieDB1(M,PN,Y,R).
Year(Y) <= (Person-name(PN),InternetMovieDB1(M,PN,Y,R).
Role(R) <= (Person-name(PN),InternetMovieDB1(M,PN,Y,R).
Person-name(PN) <= movie(M),InternetMovieDB2(M,PN,R).
Role(R) <= movie(M),InternetMovieDB2(M,PN,R).
Movie(M) <= LocalListings(M).
Url(U) <= movie(M),
EbertReviews(M,U).
Movie(M) <= TeenMovieCritic(M,U).
Url(U) <= TeenMovieCritic(M,U).
Person-name("Marlon Brando").
By format we mean a string such as dvd, vhs,
hardcover, paperback, CD, cassette, etc. Prices are
numbers. Addresses are strings. Movie-category is comedy, drama, etc.
Color-type is black and white.
We adopt the following world relations:
Title(X) <= Movie-title(X).
Title(X) <= Music-title(X).
Title(X) <= Book-title(X).
If you need additional relations in order to encode the
contents of a relevant site, just email us with the proposed relation you
need, and we'll add it (or suggest an alternative).
When writing wrappers,
you will probably find it helpful to use a regular expression
package. You are not required to use a regular expression package but
if you do we ask that you use the following package:
Oromatcher
This package is fully compatible with Perl5 syntax for regular
expressions, is freely available and widely used. Please use version
1.07 and not the beta version 1.1. If you use regular expressions,
please use only this package so that all members of the class can
easily use your wrappers.