Homework 6

Total points: 100

Due date: June 2nd, 6:30pm (in class or use the dropbox)

What you will learn: understand the role of the relational algebra; learn to explain why queries are slow or fast.

1. Which of the following identities are correct? In each case, either answer YES, or answer NO and give a counterexample:

a. δ(δ(R)) = δ(R)

b. Π_L(Π_L(R))= Π_L(R), where L = a set of attributes

c. Π_K(Π_L(R))= Π_K,L(R), where K, L = sets of attributes

d. σ_C1(σ_C2(R)) = σ_{C1 and C2}(R), where C1, C2 are conditions

e. Υ_{L, agg(A)}(Υ_{L, agg(A)}(R)) = Υ_{L, agg(A)}(R), where L = a set of group-by attributes, agg is an aggregate operator, and A is an attribute

2. Consider the following relational schema, representing a database of documents:

Doc(did, docTitle)
Occurs(did, wid)
Word(wid, word)

For the following query, show an equivalent relational algebra plan:

    SELECT z.word, count(*)
    FROM Doc x, Occurs y, Word z
    WHERE x.did = y.did and y.wid = z.wid
    GROUP BY z.wid, z.word

3. Suppose B(R) = 10,000 and T(R) = 500,000. The table is clustered. Let there be an index on R.a, and let V(R,a) = k, for some number k. Give the cost of σ_a=3(R) as a function of k, under the following circumstances. You may neglect the disk I/O’s needed to access the index itself.

a. The index is clustered

b. The index is not clustered

c. No index is used.

4. Consider the two queries below:
-- q1
select c.mid, COUNT(*)
from Movie m, Casts c
where m.id = c.mid
group by c.mid

-- q2
select m.name, COUNT(*)
from Movie m, Casts c
where m.id = c.mid
group by m.name

a. Run these two queries on the IMDB database on SQL Server: one is about twice faster than the other. Examine the query plans for these two queries, and explain the difference in running time.

b. The Casts_FK table is identical to Casts but has foreign keys declared to the Movie and Actor tables. Examine the plans for the two queries before and after replacing Casts with Casts_FK. Which one changed and why.