From: Steven Balensiefer (alaska@cs.washington.edu)
Date: Mon May 03 2004 - 00:18:57 PDT
This paper discusses the theory behind answering queries by composing
information from distinct sources presenting logical views. In addition to
the theory there's also a discussion of two different systems that seek to
provide this kind of integration.
The first section was an overview of the containment in conjunctive
queries as well as the translation between conjunctive queries in datalog.
I thought the examples were very helpful, and appreciated the treatment of
the material, but it was all results from prior work so I'm not going to
spend time talking about it.
I thought the key idea of the paper was that the subgoals in any datalog
query must be covered by logical views, and that only views needed to
cover the subgoals should be used. I was disappointed that the theorems
about minimal size excluded arithmetic comparisions and negations. From
the material in the first part, I gathered that including those things
greatly complicates the containment calculations. Even so, limiting the
queries to only positive conjunctions seems to be a major constraint. I
can't claim to know it's impact on the descriptive capability of the
language, but from a programmers standpoint, restating everything would
appear to either require additional global predicates, or a number of
unintuitive rewrites.
When it came time to actually describe the process of integrating this
information and querying the "mediators" that coalesce the various views,
the earlier material on containment all made sense. The information
Manifold approach seems like a more direct approach, and Ullman says that
it relies on the basic minimalization technique presented in the minimal
solution theorems.
In contrast, the Tsimmis approach has the mediator export objects that
provide access to data contained in the views from the data sources.
Though Ullman showed an example where removing all access to the
underlying views was a mistake, he was quick to note that it was a
contrived example. The key to this approach, in my mind was ability to
handle semi-structured data (XML anyone?) and the way it dealt with the
presence or absence of subobjects.
One of the drawbacks to the whole Tsimmis approach was the requirement
for the correct exported data from the mediator, something that could
easily change based on the work-load. In commercial database systems there
exist a vast array of tuning "knobs" and this would simply be adding to
that number.
I'd expect that both of these methods for dealing with this problem
provided good ideas and even starting points for current research on the
information integration process. I think it's fair to say that this is a
hugely important area and that a major breakthrough in integrating
varieties of different data sources will have major applications in all
fields from military operations to financial-planning to internet rumor
mills.
This archive was generated by hypermail 2.1.6 : Mon May 03 2004 - 00:18:57 PDT