Notes
Slide Show
Outline
1


End of Query Optimization
Data Integration

May 24, 2004
2
Agenda
  • Questions?
  • Finish last bits of query optimization
  • Data integration: the last frontier
3
                   Query Execution
4
Query Execution Plans
5
We’ve Seen So Far
  • Transformation rules
  • The cost module:
    • Given a candidate plan: what is its expected cost and size of the result?



  • Now: putting it all together.
6
Plans for Single-Relation Queries
(Prep for Join ordering)
  • Task: create a query execution plan for a single Select-project-group-by block.
  • Key idea: consider each possible access path to the relevant tuples of the relation. Choose the cheapest one.
  • The different operations are essentially carried out together (e.g., if an index is used for a selection, projection is done for each retrieved tuple, and the resulting tuples are pipelined into the aggregate computation).
7
Example
  • If we have an Index on rating:
    • (1/NKeys(I)) * NTuples(R) = (1/10) * 40000 tuples retrieved.
    • Clustered index: (1/NKeys(I)) * (NPages(I)+NPages(R)) = (1/10) * (50+500) pages are retrieved (= 55).
    • Unclustered index: (1/NKeys(I)) * (NPages(I)+NTuples(R)) = (1/10) * (50+40000) pages are retrieved.
  • If we have an index on sid:
    • Would have to retrieve all tuples/pages.  With a clustered index, the cost is 50+500.
  • Doing a file scan: we retrieve all file pages (500).
8
Determining Join Ordering
  • R1       R2        ….       Rn
  • Join tree:







  • A join tree represents a plan. An optimizer needs to inspect many (all ?) join trees
9
Types of Join Trees
  • Left deep:
10
Types of Join Trees
  • Bushy:


11
Types of Join Trees
  • Right deep:


12
Problem
  • Given: a query  R1     R2      …      Rn
  • Assume we have a function cost() that gives us the cost of every join tree
  • Find the best join tree for the query
13
Join Ordering by Dynamic Programming
  • Idea: for each subset of {R1, …, Rn}, compute the best plan for that subset
  • In increasing order of set cardinality:
    • Step 1: for {R1}, {R2}, …, {Rn}
    • Step 2: for {R1,R2}, {R1,R3}, …, {Rn-1, Rn}
    • …
    • Step n: for {R1, …, Rn}
  • A subset of {R1, …, Rn} is also called a subquery
14
Dynamic Programming: step 1
  • Step 1: For each {Ri} do:
    • Size({Ri}) = B(Ri)
    • Plan({Ri}) = Ri
    • Cost({Ri}) = (cost of scanning Ri)
15
Dynamic Programming: step i:
  • Step i: For each Q in {R1, …, Rn} of cardinality i do:
    • Compute Size(Q)
    • For every pair of subqueries Q’, Q’’
      s.t. Q = Q’ U Q’’
      compute cost(Plan(Q’)       Plan(Q’’))
    • Cost(Q) = the smallest such cost
    • Plan(Q) = the corresponding plan
16
A few practical considerations
  • Heuristics for reducing the search space
    • Restrict to left linear trees
    • Restrict to trees “without cartesian product”
  • Need more than just one plan for each subquery:
    • “interesting orders”: save a single plan for every possible ordering of the result.
    • Why?
17
Query Optimization Summary
  • Create initial (naïve) query execution plan.
  • Apply transformation rules:
    • Try to un-nest blocks
    • Move predicates and grouping operators.
  • Consider each block at a time:
    • Determine join order
    • Push selections, projections if possible.
18
Data Integration
19
What is Data Integration
  • Providing
    • Uniform (same query interface to all sources)
    • Access to (queries; eventually updates too)
    • Multiple  (we want many, but 2 is hard too)
    • Autonomous (DBA doesn’t report to you)
    • Heterogeneous (data models are different)
    • Structured (or at least semi-structured)
    • Data Sources (not only databases).
20
The Problem: Data Integration
  • Uniform query capability across autonomous, heterogeneous data sources on LAN, WAN, or Internet
21
Motivation(s)
  • Enterprise data integration; web-site construction.
  • WWW:
    • Comparison shopping
    • Portals integrating data from multiple sources
    • B2B, electronic marketplaces
  • Science and culture:
    • Medical genetics: integrating genomic data
    • Astrophysics: monitoring events in the sky.
    • Environment: Puget Sound Regional Synthesis Model
    • Culture: uniform access to all cultural databases produced by countries in Europe.
22
Discussion
  • Why is it hard?


  • How will we solve it?
23
Current Solutions
  • Mostly ad-hoc programming: create a special solution for every case; pay consultants a lot of money.
  • Data warehousing: load all the data periodically into a warehouse.
    • 6-18 months lead time
    • Separates operational DBMS from decision support DBMS. (not only a solution to data integration).
    • Performance is good; data may not be fresh.
    • Need to clean, scrub you data.
24
Data Warehouse Architecture
25
The Virtual Integration Architecture
  • Leave the data in the sources.
  • When a query comes in:
    • Determine the relevant sources to the query
    • Break down the query into sub-queries for the sources.
    • Get the answers from the sources, and combine them appropriately.
  • Data is fresh.
  • Challenge: performance.
26
Virtual Integration Architecture
27
Data Integration:
Higher-level Abstraction
28
 
29
Research Projects
  • Garlic (IBM),
  • Information Manifold (AT&T)
  • Tsimmis, InfoMaster (Stanford)
  • The Internet Softbot/Razor/Tukwila (UW)
  • Hermes (Maryland)
  • DISCO, Agora (INRIA, France)
  • SIMS/Ariadne (USC/ISI)
  • Many, many more!


30
Semantic Mappings
  • Differences in:
    • Names in schema
    • Attribute grouping









    • Coverage of databases
    • Granularity and format of attributes
31
Issues for Semantic Mappings
32
Beyond Data Integration
  • Mediated schema is a bottleneck for large-scale data sharing




  • It’s hard to create, maintain, and agree upon.
33
Peer Data Management Systems
34
PDMS-Related Projects
  • Hyperion (Toronto)
  • PeerDB (Singapore)
  • Local relational models (Trento)
  • Edutella (Hannover, Germany)
  • Semantic Gossiping (EPFL Zurich)
  • Raccoon (UC Irvine)
  • Orchestra (Ives, U. Penn)
35
A Few Comments about Commerce
  • Until 5 years ago:
    • Data integration = Data warehousing.
  • Since then:
    • A wave of startups:
      • Nimble, MetaMatrix, Calixa, Composite, Enosys
    • Big guys made announcements (IBM, BEA).
    • [Delay] Big guys released products.
  • Success: analysts have new buzzword – EII
    • New addition to acronym soup (with EAI).
  • Lessons:
    • Performance was fine. Need management tools.

36
Data Integration: Before
37
Data Integration: After
38
Sound Business Models
  • Explosion of intranet and extranet information
  • 80% of corporate information is unmanaged
  • By 2004 30X more enterprise data than 1999
  • The average company:
    • maintains 49 distinct enterprise applications
    • spends 35% of total IT budget on integration-related efforts

39
Sound Business Models
  • Explosion of intranet and extranet information
  • 80% of corporate information is unmanaged
  • By 2004 30X more enterprise data than 1999
  • The average company:
    • maintains 49 distinct enterprise applications
    • spends 35% of total IT budget on integration-related efforts

40
Dimensions to Consider
  • How many sources are we accessing?
  • How autonomous are they?
  • Meta-data about sources?
  • Is the data structured?
  • Queries or also updates?
  • Requirements: accuracy, completeness, performance, handling inconsistencies.
  • Closed world assumption vs. open world?
41
Outline
  • Wrappers
  • Semantic integration and source descriptions:
    • Modeling source completeness
    • Modeling source capabilities
  • Query optimization
  • Query execution
  • Peer-data management systems
  • Creating schema mappings
42
Wrapper Programs
  • Task: to communicate with the data sources and do format translations.
  • They are built w.r.t. a specific source.
  • They can sit either at the source or at the mediator.
  • Often hard to build (very little science).
  • Can be “intelligent”: perform source-specific optimizations.
43
Example
44
Data Source Catalog
  • Contains all meta-information about the sources:
    • Logical source contents (books, new cars).
    • Source capabilities (can answer SQL queries)
    • Source completeness (has all books).
    • Physical properties of source and network.
    • Statistics about the data (like in an RDBMS)
    • Source reliability
    • Mirror sources
    • Update frequency.
45
Content Descriptions
  • User queries refer to the mediated schema.
  • Data is stored in the sources in a local schema.
  • Content descriptions provide the semantic mappings between the different schemas.
  • Data integration system uses the descriptions to translate user queries into queries on the sources.
46
Desiderata from Source Descriptions
  • Expressive power: distinguish between sources with closely related data. Hence, be able to prune access to irrelevant sources.
  • Easy addition: make it easy to add new data sources.
  • Reformulation: be able to reformulate a user query into a query on the sources efficiently and effectively.
47
Reformulation Problem
  • Given:
    • A query Q posed over the mediated schema
    • Descriptions of the data sources
  • Find:
    • A query Q’ over the data source relations, such that:
      • Q’ provides only correct answers to Q, and
      • Q’ provides all possible answers from to Q given the sources.
48
Languages for Schema Mapping
49
Global-as-View
  • Mediated schema:
  •     Movie(title, dir, year, genre),
  •     Schedule(cinema, title, time).
  • Create View Movie AS
  •    select *  from S1     [S1(title,dir,year,genre)]
  •    union
  •    select  * from S2     [S2(title, dir,year,genre)]
  •    union                       [S3(title,dir), S4(title,year,genre)]
  •    select S3.title, S3.dir, S4.year, S4.genre
  •    from  S3, S4
  •    where S3.title=S4.title



50
Global-as-View: Example 2
  • Mediated schema:
  •     Movie(title, dir, year, genre),
  •     Schedule(cinema, title, time).


  • Create View Movie AS [S1(title,dir,year)]
  •    select title, dir, year, NULL
  •    from S1
  •    union                          [S2(title, dir,genre)]
  •    select  title, dir, NULL, genre
  •    from S2


51
Global-as-View: Example 3
  • Mediated schema:
  •     Movie(title, dir, year, genre),
  •     Schedule(cinema, title, time).
  • Source S4:   S4(cinema, genre)
  • Create View Movie AS
  •    select NULL, NULL, NULL, genre
  •    from S4
  •  Create View Schedule AS
  •     select cinema, NULL, NULL
  •     from S4.
  • But what if we want to find which cinemas are playing comedies?


52
Global-as-View Summary
  • Query reformulation boils down to view unfolding.
  • Very easy conceptually.
  • Can build hierarchies of mediated schemas.
  • You sometimes loose information. Not always natural.
  • Adding sources is hard. Need to consider all other sources that are available.
53
Local-as-View (LAV)
54
Query Reformulation
55
Query Reformulation
56
Local-as-View: example 1
  • Mediated schema:
  •     Movie(title, dir, year, genre),
  •     Schedule(cinema, title, time).
  • Create Source S1 AS
  •    select * from Movie
  • Create Source S3 AS     [S3(title, dir)]
  •    select title, dir from Movie
  • Create Source S5 AS
  •    select title, dir, year
  •    from Movie
  •    where year > 1960 AND genre=“Comedy”
57
Local-as-View: Example 2
  • Mediated schema:
  •     Movie(title, dir, year, genre),
  •     Schedule(cinema, title, time).
  • Source S4:   S4(cinema, genre)
  • Create Source S4
  •    select cinema, genre
  •    from Movie m, Schedule s
  •    where m.title=s.title
  • .
  • Now if we want to find which cinemas are playing comedies, there is hope!


58
Local-as-View Summary
  • Very flexible. You have the power of the entire query language to define the contents of the source.
  • Hence, can easily distinguish between contents of closely related sources.
  • Adding sources is easy: they’re independent of each other.
  • Query reformulation: answering queries using views!
59
The General Problem
  • Given a set of views V1,…,Vn, and a query Q, can we answer Q using only the answers to V1,…,Vn?
  • Many, many papers on this problem.
  • The best performing algorithm: The MiniCon Algorithm, (Pottinger & Levy, 2000).
  • Great survey on the topic: (Halevy, 2001).
60
Local Completeness Information
  • If sources are incomplete, we need to look at each one of them.
  • Often, sources are locally complete.
  • Movie(title, director, year) complete for years after 1960, or for American directors.
  • Question: given a set of local completeness statements, is a query Q’ a complete answer to Q?
61
Example
  • Movie(title, director, year) (complete after 1960).
  • Show(title, theater, city, hour)
  • Query: find movies (and directors) playing in Seattle:
  •     Select m.title, m.director
  •     From Movie m, Show s
  •     Where m.title=s.title AND city=“Seattle”
  • Complete or not?
62
Example #2
  • Movie(title, director, year), Oscar(title, year)
  • Query: find directors whose movies won Oscars after 1965:
  •     select m.director
  •     from Movie m, Oscar o
  •     where m.title=o.title AND m.year=o.year AND  o.year > 1965.
  • Complete or not?
63
Query Optimization
  • Very related to query reformulation!
  • Goal of the optimizer: find a physical plan with minimal cost.
  • Key components in optimization:
    • Search space of plans
    • Search strategy
    • Cost model
64
Optimization in Distributed DBMS
  • A distributed database (2-minute tutorial):
    • Data is distributed over multiple nodes, but is uniform.
    • Query execution can be distributed to sites.
    • Communication costs are significant.
  • Consequences for optimization:
    • Optimizer needs to decide locality
    • Need to exploit independent parallelism.
    • Need operators that reduce communication costs (semi-joins).


65
DDBMS vs. Data Integration
  • In a DDBMS, data is distributed over a set of uniform sites with precise rules.
  • In a data integration context:
    • Data sources may provide only limited access patterns to the data.
    • Data sources may have additional query capabilities.
    • Cost of answering queries at sources unknown.
    • Statistics about data unknown.
    • Transfer rates unpredictable.
66
Modeling Source Capabilities
  • Negative capabilities:
    • A web site may require certain inputs (in an HTML form).
    • Need to consider only valid query execution plans.
  • Positive capabilities:
    • A source may be an ODBC compliant system.
    • Need to decide placement of operations according to capabilities.
  • Problem: how to describe and exploit source capabilities.
67
Example #1: Access Patterns
  • Mediated schema relation: Cites(paper1, paper2)


  • Create Source S1 as
  •    select *
  •    from Cites
  •    given paper1
  • Create Source S2 as
  •     select paper1
  •     from Cites
  • Query: select paper1 from Cites where paper2=“Hal00”
68
Example #1: Continued
  • Create Source S1 as
  •    select *
  •    from Cites
  •    given paper1
  • Create Source S2 as
  •     select paper1
  •     from Cites
  •  Select p1
  •  From S1, S2
  •  Where S2.paper1=S1.paper1 AND S1.paper2=“Hal00”
69
Example #2: Access Patterns
  • Create Source S1 as
  •    select *
  •    from Cites
  •    given paper1
  • Create Source S2 as
  •     select paperID
  •     from UW-Papers
  • Create Source S3 as
  •      select paperID
  •      from  AwardPapers
  •      given paperID
  • Query: select * from AwardPapers
70
Example #2: Solutions
  • Can’t go directly to S3 because it requires a binding.
  • Can go to S1, get UW papers, and check if they’re in S3.
  • Can go to S1, get UW papers, feed them into S2, and feed the results into S3.
  • Can go to S1, feed results into S2, feed results into S2 again, and then feed results into S3.
  • Strictly speaking, we can’t a priori decide when to stop.
  • Need recursive query processing.
71
Handling Positive Capabilities
  • Characterizing positive capabilities:
    • Schema independent (e.g., can always perform joins, selections).
    • Schema dependent: can join R and S, but not T.
    • Given a query, tells you whether it can be handled.
  • Key issue: how do you search for plans?
  • Garlic approach (IBM): Given a query, STAR rules determine which subqueries are executable by the sources. Then proceed bottom-up as in System-R.
72
Matching Objects Across Sources
  • How do I know that A. Halevy in source 1 is the same as Alon Halevy in source 2?
  • If there are uniform keys across sources, no problem.
  • If not:
    • Domain specific solutions (e.g., maybe look at the address, ssn).
    • Use Information retrieval techniques (Cohen, 98). Judge similarity as you would between documents.
    • Use concordance tables. These are time-consuming to build, but you can then sell them for lots of money.
73
Optimization and Execution
  • Problem:
    • Few and unreliable statistics about the data.
    • Unexpected (possibly bursty) network transfer rates.
    • Generally, unpredictable environment.
  • General solution: (research area)
    • Adaptive query processing.
    • Interleave optimization and execution. As you get to know more about your data, you can improve your plan.
74
Tukwila Data Integration System
  • Novel components:
    • Event handler
    • Optimization-execution loop
75
Double Pipelined Join (Tukwila)
  • Hash Join
    • Partially pipelined: no output until inner read
    • Asymmetric (inner vs. outer) — optimization requires source behavior knowledge
76
Semantic Mappings
  • Need mappings in every data sharing architecture











  • “Standards are great, but there are too many.”
77
Why is it so Hard?
  • Schemas never fully capture their intended meaning:
    • We need to leverage any additional information we may have.
  • A human will always be in the loop.
    • Goal is to improve designer’s productivity.
    • Solution must be extensible.
  • Two cases for schema matching:
    • Find a map to a common mediated schema.
    • Find a direct mapping between two schemas.
78
Typical Matching Heuristics
  • We build a model for every element from multiple sources of evidences in the schemas
    • Schema element names
      • BooksAndCDs/Categories ~ BookCategories/Category
    • Descriptions and documentation
      • ItemID: unique identifier for a book or a CD
      • ISBN: unique identifier for any book
    • Data types, data instances
      • DateTime ¹ Integer,
      • addresses have similar formats
    • Schema structure
      • All books have similar attributes


79
Using Past Experience
  • Matching tasks are often repetitive
  • Humans improve over time at matching.
    • A matching system should improve too!




  • LSD:
    • Learns to recognize elements of mediated schema.
    • [Doan, Domingos, H., SIGMOD-01, MLJ-03]
      • Doan: 2003 ACM Distinguished Dissertation Award.


80
Example: Matching Real-Estate Sources
81
Learning Source Descriptions
  • We learn a classifier for each element of the mediated schema.
  • Training examples are provided by the given mappings.
  • Multi-strategy learning:
    • Base learners: name, instance, description
    • Combine using stacking.
  • Accuracy of 70-90% in experiments.
  • Learning about the mediated schema.
82
Multi-Strategy Learning


  • Use a set of base learners:
    • Name learner, Naïve Bayes, Whirl, XML learner
  • And a set of recognizers:
    • County name, zip code, phone numbers.
  • Each base learner produces a prediction weighted by confidence score.
  • Combine base learners with a meta-learner, using stacking.


83
The Semantic Web
  • A web of structured data:
    • The 5-year old vision of Tim Berners-Lee
  • How does it relate to data integration?
  • How are we going to do it?
  • Why should we do it? Do we need a killer app or is the semantic web a killer app?
84
The End