CSE 592 Project:
Information Integration

The CSE 592 project will be to implement an Intelligent Internet Information Integration System in Java. We expect people to work in teams of 3 people (plus or minus one). Papers 6.1 and 6.2 contain important information, so be sure to read them carefully. Note that this project description is intended as a starting point - groups are encouraged to add additional features beyond what we describe herein. Alternatively, groups are welcome to propose an alternative project (with milestones and timetable) if they wish.

CONTENTS
Motivation
Datalog
Data Formats
Example
Ontology for the Movie Domain
Project Requirements
Timetable
Resources
MOTIVATION
Thanks to the Internet, thousands of structured information sources are available for querying, and the number and variety of these sites is growing rapidly. While a wide range of questions can be answered via the Internet, the morass of sources means that users cannot easily get the information they need.  Humans face three problems when trying to gather information. First, they must determine which of the myriad sites has information relevant to their question. Second, they must learn to navigate the sites' idiosyncratic interfaces. Third, for many queries they must integrate the data returned by several different sites.
By automating this process, a software agent can greatly simplify the task of gathering information. For example, a user could ask for reviews of all movies starring Marlon Brando playing in Seattle. To gather the desired information, the agent must reason about the contents and capabilities of different information sources. In this case, no single information source can answer the query, and there are several choices of how to do it.  The agent might first go to the Internet Movie Database to get a list of movies starring Marlon Brando, then go to MovieLink to see which of these movies is showing in Seattle, finally to Ebert to get reviews of each of the relevant movies. Because most information sources are incomplete, it is often necessary to execute more than one such plan. For example, since Ebert contains only a fraction of the movie reviews on the web, the agent can return more information by also going to Cinemachine.

As your class project, you will construct a simple version of an information integration agent. In order to make progress we must answer the following questions:

We answer (some of) these questions below, suggest an architecture, specify some interfaces, provide some sample code, and provide a timetable for progress on the project.
DATALOG
We will use a case-insensitive dialect of datalog (i.e. a logical language which is a subset of first order predicate calculus, disallowing  functions), as a knowledge representation language for the input and output of the system. If this doesn't make sense to you, please reread section 2 of paper 6.2. The main components of datalog are relations and rules.

Relations are the nouns of datalog. A relation has a case-insensitive name by which it is uniquely identified, followed by a list of arguments in parentheses. 

    actor-in("Eating Raoul","Luke Skywalker","Hugh Grant")
The number of arguments (the arity) of a relation is fixed, as are the types of the arguments. Arguments are terms, possibly with an annotation (see domain descriptions below). Terms may be constants, such as those shown above in quotation marks, or variables.

Rules are the sentences of datalog. Each rule has three parts: head, connector, and body. A rule can be thought of as an operational definition of the head in terms of the body. For example, in the following rule

InternetMovieDB(Movie,Actor) => actor-in(Movie,Role,Actor); year-of(Movie,Year); >(Year,1969). we have "InternetMovieDB(Movie,Actor) " as the head. In general the head is always a relation. The connector is one of ("=>", "<=", or "<=>"). The body is a conjunctive expression. We use ";" to stand for "AND". [In the definition of plans below, we will see "," used for "AND" as well, whenever ordering must be maintained.] Expressions are defined recursively, using parentheses, and end with a period. Spacing and line breaks don't matter.

Safe rules. We require that all rules be "safe," i.e. they obey the constraint that all variables occurring in the head occur in the body as well.
 

Input: Declarations The first section of the input to the information agent should declare all the data types. Datatypes are categories of strings or numbers, like first names, zip codes, or phone numbers. They cannot be compound types. (If you wish to extend the required project, you can allow for general hierarchical types) Each declaration names one or more types. The declarations should be preceeded by the keyword types: The first section should also declare all the IDB relations (also called world relations or ontology relations). Intuitively, these are the relations that define how you think the data are related, rather than those that define where the data resides on the web. We have provided a set of these relations for the movie domain for you to be found later in this assignment. If you wish for us to add others, please email the instructor or TA. Here are examples of the IDB relations: Input: Information Source Descriptions: A source description defines an information source in terms of the world relations. There are three aspects to this definition: the kind (number, names, and types) of data in the site, the method of access, and the world relations that pertain to that data. For instance, In other words, the Internet Movie Database has a form that, when given an actor, returns a table, each row of which contains a movie that actor was in, the role he or she played, the year, and the actor's name. The number and names of the arguments are in the head. The world relations pertaining to the data in the site are in the body. We almost always use the => connector for source descriptions, which indicates that any tuples returned by IMDB satisfy the relations on the right hand side, but that the source does not necessarily return all such relations. If IMDB were complete (i.e. returned all records satisfying the actor-in and year-of constraints), then we would use <=> in its definition. (Reasoning about such locally complete definitions is optional in this project).
Note that the Actor input argument is annotated with a $; this signifies that the Actor field must be bound before initiating a query to this information source. Reasoning about these kind of binding patterns is a required part of the project.

You will be writing the rules for information source descriptions and will need to find and describe sources on the web which supply information about the world relations we provide. We will make available rules written by other groups so they can be used by everyone.

Input: Query: The query is a definition of the results of an interaction with the system, which will be in table format. So it is a new relation, defined in terms of other relations. We will only be supporting conjunctive queries over correct but incomplete data sources.

The Query is just a special rule defining query relations. It doesn't matter which connector you use, it means the same thing. This one says that we are collecting into a relation "hugh" all movies that star Hugh Grant in any role, that are showing in Seattle, as well as all reviews of those movies.

Output: Recursive Plans The planning component of the system will output a recursive plan, which is an executable datalog program for solving the query called a recursive plan. The plan contains all reasonable sequences of source accesses that can result in an answer to the query. Reasonable sequences respect the types given above, and limit themselves to data in the sources and the query.
A recursive plan only uses the <= connector. There are no annotations ($). There are no declarations.

The following example illustrates sample contents of the declarations, source description and query. We use C++ style comments, i.e. lines starting with a //