Homework 3

Total points: 100

Due date: Wednesday, May 19 2010. Hardcopy in Class.

What to read: Chapters 5.1, 5.2 (Relational Algebra), and Chapters 15.1, 15.2, 15.3, 15.4, 15.5, 15.6 (Query execution). Read the lecture notes.

What you will learn: understand the role of the relational algebra; learn to explain why queries are slow or fast.

1. Which of the following identities are correct? In each case, either answer YES, or answer NO and give a counterexample:

a. δ(δ(R)) = δ(R)

b. Π_L(Π_L(R))= Π_L(R), where L = a set of attributes

c. Π_K(Π_L(R))= Π_K,L(R), where K, L = sets of attributes

d. σ_C(σ_C(R)) = σ_C(R), where C = a condition

e. Υ_{L, agg(A)}(Υ_L,
agg(A)(R)) = Υ_{L, agg(A)}(R), where L = a set of group-by attributes, agg is an aggregate operator, and A is an attribute

2. Consider the following relational schema, representing a database of documents and a set of keywords:

Doc(did, docTitle)
Occurs(did, wid)
Word(wid, word)
Keywords(word)

For each of the queries below, show an equivalent relational algebra plan:

a.       SELECT z.word, count(*)
      FROM Doc x, Occurs y, Word z
      WHERE x.did = y.did and y.wid = z.wid
      GROUP BY z.wid, z.word

b.       SELECT x.did, x.docTitle
      FROM Doc x
      WHERE not exists (     SELECT *
                                        FROM Keywords u
                                        WHERE not exists (     SELECT *
                                                                          FROM Occurs y, Word z
                                                                          WHERE x.did=y.did and y.wid=z.wid
                                                                         and z.word = u.word))

3. (based on Exercise 15.6.2, pp. 745 in the textbook) Suppose B(R) = 10,000 and T(R) = 500,000. The table is clustered. Let there be an index on R.a, and let V(R,a) = k, for some number k. Give the cost of σa=0(R) as a function of k, under the following circumstances. You may neglect the disk I/O’s needed to access the index itself.

a. The index is clustered (the book says the index is clustering; it is the same thing)

b. The index is not clustered

c. No index is used.

4. Consider the two queries below:
-- q1
select c.mid, COUNT(*)
from Movie m, Casts c
where m.id = c.mid
group by c.mid

-- q2
select m.name, COUNT(*)
from Movie m, Casts c
where m.id = c.mid
group by m.name

a. Run these two queries on the IMDB database on SQL Server: one is about twice faster than the other. Examine the query plans for these two queries, and explain the difference in running time.

b. The IMDB_FOREIGNKEYS database is identical to IMDB but has foreign keys declared in Casts. Examine the plans for the two queries on the IMDB_FOREIGNKEYS database, and explain which one changed from the IMDB database and why.