CSE 444 Homework 3

Objectives:: To understand the role of the relational algebra; learn to explain why queries are slow or fast.
Reading assignments:: Chapters 5.1, 5.2 (relational algebra), and Chapters 15.1, 15.2, 15.3, 15.4, 15.5, 15.6 (query execution). Read the lecture notes.
Number of points:: 100
Due date:: February 25, 2011

Questions:

[25 points] Which of the following identities are correct? In each case, either answer YES, or answer NO and give a counterexample:
1. δ(δ(R)) = δ(R)
2. Π_L(Π_L(R)) = Π_L(R), where L = a set of attributes
3. Π_K(Π_L(R)) = Π_K,L(R), where K, L = sets of attributes
4. σ_C(σ_C(R)) = σ_C(R), where C = a condition
5. γ_{L, agg(A)}(γ_{L, agg(A)}(R)) = γ_{L, agg(A)}(R), where L = a set of group-by attributes, agg is an aggregate operator, and A is an attribute
[25 points] Consider the following relational schema, representing a database of documents and a set of keywords:

Doc(did, docTitle)
Occurs(did, wid)
Word(wid, word)
Keywords(word)

For each of the queries below, show an equivalent relational algebra plan:
1. SELECT z.word, count(*)
  FROM Doc x, Occurs y, Word z
  WHERE x.did = y.did and y.wid = z.wid
  GROUP BY z.wid, z.word
2. SELECT x.did, x.docTitle
  FROM Doc x
  WHERE not exists (SELECT *
                                 FROM Keywords u
                                 WHERE not exists (SELECT *
                                                                FROM Occurs y, Word z
                                                                WHERE x.did=y.did and y.wid=z.wid
                                                                and z.word = u.word))
  
  Hint: Try to express what this query does in plain English before constructing the query plan.
[25 points] (based on Exercise 15.6.2, pp. 745 in the textbook) Suppose B(R) = 10,000 and T(R) = 500,000. The table is clustered. Let there be an index on R.a, and let V(R,a) = k, for some number k. Give the cost of σ_a=0(R) as a function of k, under the following circumstances. You may neglect the disk I/O's needed to access the index itself.
1. The index is clustered (the book says the index is clustering; it is the same thing).
2. The index is not clustered.
3. No index is used.
[25 points] Consider the two queries below:
-- q1
select c.mid, COUNT(*)
from Movie m, Casts c
where m.id = c.mid
group by c.mid

-- q2
select m.name, COUNT(*)
from Movie m, Casts c
where m.id = c.mid
group by m.name
1. Run these two queries on the IMDB database on IISQLSRV; one is about twice as fast as the other. Examine the query plans for these two queries, and explain the difference in running time.
2. The IMDB_FOREIGNKEYS database is identical to IMDB but has foreign keys declared in Casts. Examine the plans for the two queries on the IMDB_FOREIGNKEYS database, and explain which one changed from the IMDB database and why.