Collector Project Specifications

In order to assist you with projects related to collectors, we have put together a document describing some of the basic inputs you should expect, and which you may wish to handle. Note that this list is extensible and is intended only as a starting point.

Queries

Your program will input conjunctive queries in a variation of datalog. The rules will look like:

q1(X, Y) :- s1(X), {s2, s3, s4, s5} (X,Y)

where {source1, source2, ...} describes a set of alternative sources for a given EDB predicate.

We are attempting to put together a very simple parser which takes this extended datalog and outputs a simple logical query plan. The goal is to have this available by Thursday; stay tuned. You are of course free to use your own parser.

Data Source Catalog Information

The data source catalog is assumed to include the following information. We do not actually have a data source catalog implemented, so you are free to specify this information in any way you choose.

Data Source Coverage and Overlap

For a detailed paper on using probabilistic information for data integration, see: D. Florescu, D. Koller, A. Levy, Using Probabilistic Information in Data Integration, VLDB 1997.

For the purposes of your project, we can assume a somewhat simpler model. Rather than a complete probability distribution, you may choose to assume that we only have one topic. Now the data source coverage is expressed simply as the probability that any tuple in the domain can be found within the specific data source.

To further simplify things, we can assume that overlap information is expressed as the following:

In many cases, the overlap information will actually not be specified. For these cases, assume independence of data sources.
Otherwise, assume that one source is always a subset of the other. Then the overlap information will be specified in a record as follows:
<largeSource, smallerSource, Pr(smallerSource | largerSource)>

Other Potentially Useful Information

The following types of information may potentially be of interest. Feel free to support any or all of these:

Data age/freshness
Time-to-live (for caching purposes)
Expected/average round-trip time to data source
Other costs (e.g. dollar costs) related to communicating with data source

The Data

We will attempt to assist you in this area (but will probably ask for contributions from one or more of your group members). The goal is to "grab" a large amount of data (the proposed source at this point is Amazon.com) and extract it to a database. From this we will create a test suite generator program that will create numerous smaller databases with desired characteristics (including "completeness" and overlap). More information will be forthcoming.