CSE 592 Project: Information Integration

CSE 592 Project: Information Integration

The CSE 592 project will be to implement an Intelligent Internet Information Integration System in Java. We expect people to work in teams of 3 people (plus or minus one). Papers 6.1 and 6.2 contain important information, so be sure to read them carefully. Note that this project description is intended as a starting point - groups are encouraged to add additional features beyond what we describe herein. Alternatively, groups are welcome to propose an alternative project (with milestones and timetable) if they wish.

CONTENTS

Motivation
Datalog
Data Formats
Example
Ontology for the Movie Domain
Project Requirements
Timetable
Resources

MOTIVATION

Thanks to the Internet, thousands of structured information sources are available for querying, and the number and variety of these sites is growing rapidly. While a wide range of questions can be answered via the Internet, the morass of sources means that users cannot easily get the information they need. Humans face three problems when trying to gather information. First, they must determine which of the myriad sites has information relevant to their question. Second, they must learn to navigate the sites' idiosyncratic interfaces. Third, for many queries they must integrate the data returned by several different sites.
By automating this process, a software agent can greatly simplify the task of gathering information. For example, a user could ask for reviews of all movies starring Marlon Brando playing in Seattle. To gather the desired information, the agent must reason about the contents and capabilities of different information sources. In this case, no single information source can answer the query, and there are several choices of how to do it. The agent might first go to the Internet Movie Database to get a list of movies starring Marlon Brando, then go to MovieLink to see which of these movies is showing in Seattle, finally to Ebert to get reviews of each of the relevant movies. Because most information sources are incomplete, it is often necessary to execute more than one such plan. For example, since Ebert contains only a fraction of the movie reviews on the web, the agent can return more information by also going to Cinemachine.

As your class project, you will construct a simple version of an information integration agent. In order to make progress we must answer the following questions:

How should the system represent information about the content of available information sources?
How should the system represent information about the capabilities of the sources?
How will user's specify their queries?
How will the system generate a plan for answering the user's query?
How will the system execute the plan? (Are query optimization techniques necessary?)
How should the resulting data be displayed?

We answer (some of) these questions below, suggest an architecture, specify some interfaces, provide some sample code, and provide a timetable for progress on the project.

DATALOG

We will use a case-insensitive dialect of datalog (i.e. a logical language which is a subset of first order predicate calculus, disallowing functions), as a knowledge representation language for the input and output of the system. If this doesn't make sense to you, please reread section 2 of paper 6.2. The main components of datalog are relations and rules.

Relations are the nouns of datalog. A relation has a case-insensitive name by which it is uniquely identified, followed by a list of arguments in parentheses.

actor-in("Eating Raoul","Luke Skywalker","Hugh Grant")

The number of arguments (the arity) of a relation is fixed, as are the types of the arguments. Arguments are terms, possibly with an annotation (see domain descriptions below). Terms may be constants, such as those shown above in quotation marks, or variables.

Rules are the sentences of datalog. Each rule has three parts: head, connector, and body. A rule can be thought of as an operational definition of the head in terms of the body. For example, in the following rule

InternetMovieDB(Movie,Actor) => actor-in(Movie,Role,Actor); year-of(Movie,Year); >(Year,1969). we have "InternetMovieDB(Movie,Actor) " as the head. In general the head is always a relation. The connector is one of ("=>", "<=", or "<=>"). The body is a conjunctive expression. We use ";" to stand for "AND". [In the definition of plans below, we will see "," used for "AND" as well, whenever ordering must be maintained.] Expressions are defined recursively, using parentheses, and end with a period. Spacing and line breaks don't matter.

Safe rules. We require that all rules be "safe," i.e. they obey the constraint that all variables occurring in the head occur in the body as well.

DATA FORMATS

Input: Declarations The first section of the input to the information agent should declare all the data types. Datatypes are categories of strings or numbers, like first names, zip codes, or phone numbers. They cannot be compound types. (If you wish to extend the required project, you can allow for general hierarchical types) Each declaration names one or more types. The declarations should be preceeded by the keyword types:

types zipcode.

types firstname lastname movie role review.

The first section should also declare all the IDB relations (also called world relations or ontology relations). Intuitively, these are the relations that define how you think the data are related, rather than those that define where the data resides on the web. We have provided a set of these relations for the movie domain for you to be found later in this assignment. If you wish for us to add others, please email the instructor or TA. Here are examples of the IDB relations:

relation actor-in(movie, role, actor).

relation review-of(movietype, reviewtype).

Input: Information Source Descriptions: A source description defines an information source in terms of the world relations. There are three aspects to this definition: the kind (number, names, and types) of data in the site, the method of access, and the world relations that pertain to that data. For instance,

InternetMovieDB3(Movie, Role, Year, $Actor) => actor-in(Movie,Role,Actor); year-of(Movie,Year).

In other words, the Internet Movie Database has a form that, when given an actor, returns a table, each row of which contains a movie that actor was in, the role he or she played, the year, and the actor's name. The number and names of the arguments are in the head. The world relations pertaining to the data in the site are in the body. We almost always use the => connector for source descriptions, which indicates that any tuples returned by IMDB satisfy the relations on the right hand side, but that the source does not necessarily return all such relations. If IMDB were complete (i.e. returned all records satisfying the actor-in and year-of constraints), then we would use <=> in its definition. (Reasoning about such locally complete definitions is optional in this project).
Note that the Actor input argument is annotated with a $; this signifies that the Actor field must be bound before initiating a query to this information source. Reasoning about these kind of binding patterns is a required part of the project.

You will be writing the rules for information source descriptions and will need to find and describe sources on the web which supply information about the world relations we provide. We will make available rules written by other groups so they can be used by everyone.

Input: Query: The query is a definition of the results of an interaction with the system, which will be in table format. So it is a new relation, defined in terms of other relations. We will only be supporting conjunctive queries over correct but incomplete data sources.

query hugh(Movie, Review) <=> actor-in(Movie,Role,"Hugh Grant");

shows-in(Movie, "Seattle",Theater); review-of(Movie, Review).

The Query is just a special rule defining query relations. It doesn't matter which connector you use, it means the same thing. This one says that we are collecting into a relation "hugh" all movies that star Hugh Grant in any role, that are showing in Seattle, as well as all reviews of those movies.

Output: Recursive Plans The planning component of the system will output a recursive plan, which is an executable datalog program for solving the query called a recursive plan. The plan contains all reasonable sequences of source accesses that can result in an answer to the query. Reasonable sequences respect the types given above, and limit themselves to data in the sources and the query.
A recursive plan only uses the <= connector. There are no annotations ($). There are no declarations.

EXAMPLE

The following example illustrates sample contents of the declarations, source description and query. We use C++ style comments, i.e. lines starting with a //

// Declarations

// Commas are totally optional in type declarations and argument lists.

types Movie, Url, Person-name, Role, Year.

// world relations

Relation review-of(Movie, Url).

Relation actor-of(Movie, Person-name, Role).

Relation year-of(Movie, Year).

Relation showing-in-seattle(Movie).

// Query
Query brando(M,U) <=>
actor-of(M,"Marlon Brando",R);
review-of(M,U);
showing-in-seattle(M).

// Source descriptions
InternetMovieDB1(M,$PN,Y,R) => actor-of(M,PN,R); year-of(M,Y).
InternetMovieDB2($M,PN,R) => actor-of(M,PN,R).
LocalListings(M) => showing-in-seattle(M).
EbertReviews($M,U) => review-of(M,U).
TeenMovieCritic(M,U) => review-of(M,U).

// Recursive plan for brando query
// Commas are required in expressions.
brando(M,U) <=
actor-of(M,"Marlon Brando",R);
review-of(M,U);
showing-in-seattle(M).
actor-of(M,PN,R) <= Person-name(PN),InternetMovieDB1(M,PN,Y,R).
year-of(M,Y) <= Person-name(PN),InternetMovieDB1(M,PN,Y,R).
actor-of(M,PN,R) <= Movie(M),InternetMovieDB2(M,PN,R).
showing-in-seattle(M) <= LocalListings(M).
review-of(M,U) <= Movie(M),EbertReviews(M,U).
review-of(M,U) <= TeenMovieCritic(M,U).
Movie(M) <= (Person-name(PN),InternetMovieDB1(M,PN,Y,R).
Year(Y) <= (Person-name(PN),InternetMovieDB1(M,PN,Y,R).
Role(R) <= (Person-name(PN),InternetMovieDB1(M,PN,Y,R).
Person-name(PN) <= movie(M),InternetMovieDB2(M,PN,R).
Role(R) <= movie(M),InternetMovieDB2(M,PN,R).
Movie(M) <= LocalListings(M).
Url(U) <= movie(M), EbertReviews(M,U).
Movie(M) <= TeenMovieCritic(M,U).
Url(U) <= TeenMovieCritic(M,U).
Person-name("Marlon Brando").

ONTOLOGY FOR THE MOVIE DOMAIN

For this project, we will actually use the following slightly different (more elaborate) ontology to encode the world ontology. There are the following types(everything below is case-insensitive):

types Title, Url, Person-name, Role, Year, City, State, Theatre, Time, Format, Store, Price, Address, Map, , Movie-title, Book-title, item-title, Author, Publisher, Music-title, Recording-artist,Song-name, Studio, Rating, Movie-category, Duration, Studio, Color-type, Oscar-category, Movie-Role, Text-review

By format we mean a string such as dvd, vhs, hardcover, paperback, CD, cassette, etc. Prices are numbers. Addresses are strings. Movie-category is comedy, drama, etc. Color-type is black and white.
We introduce the types music-id, movie-id and book-id because the titles of all of these objects do not always uniquely represent them. You will need to decide how to create these unique id's. One suggestion is to use a string which concatenates the title and the year. For books and music, you could also use the isbn number.
We also introduce the type Title which can take on the value of an item with type Movie-title, Music-title or Book-title. We do this to avoid writing three separate price, year-of and title-of relations. You will always need to add the following rules to your plan to handle this recursive type:
Title(X) <= Movie-title(X).
Title(X) <= Music-title(X).
Title(X) <= Book-title(X).
We adopt the following world relations:
If you need additional relations in order to encode the contents of a relevant site, just email us with the proposed relation you need, and we'll add it (or suggest an alternative).
You'll need to write the following:
- A user interface for entering the file that the query is stored in and returning the results to the user.
- A planner which takes a query as input (along with the site definitions) and produces a plan as output.The following functions/classes are required:
- An executor (horn rule solver) which takes a plan as input and applies wrappers asynchronously to get tuples of data which it then joins together (using selections and projections as necessary) to generate the output
- A set of 5+ wrappers (each group will share their wrappers so that a large set is available for the class as a whole). Each wrapper should be a subclass of the wrapper class which we provide.
Optional: There are many ways to extend your project if you so desire, but the (partial) list of features below is strictly optional. We include it explicitly as a point of clarification (since we are asking you to implement a subset of systems described in research papers in the readings, we wish to be clear about the aspects which are optional).
- functions to handle functional dependencies in the data sources (chase rules, etc.)
- handling local completeness rules, reasoning about subsumption, contingent planning
- adding a description logic to the representation language.
- adding arithemtic predicates or other built-in predicates.
- optimizations on any (every!) part of the system
Please note the following deadlines:
- (Monday 4/27) Each group should submit a name for the group (be imaginative) and a list of all group members (with email addresses).
- (Thursday 5/7) Select a set of at least five information sources, write source descriptions mapping their information content to the world ontology relations described above. Each group should turn in a list of the sites and the source descriptions. Please try for breadth of coverage - hit as many of the world relations as you can.
- (Thursday 5/14) Turn in complete wrappers for all your sites. Note: the objective is to get different groups writing wrappers for different sites. We will publish the complete set of wrappers, so each group can benefit from the collective effort.
- (Thursday 6/11) Writeup due along with the URL of the project code for a demo.
Look HERE for starter code and a README file to get you started.
When writing wrappers, you will probably find it helpful to use a regular expression package. You are not required to use a regular expression package but if you do we ask that you use the following package: Oromatcher
This package is fully compatible with Perl5 syntax for regular expressions, is freely available and widely used. Please use version 1.07 and not the beta version 1.1. If you use regular expressions, please use only this package so that all members of the class can easily use your wrappers.