Homework  3

 

Total points: 100

Due date: Wednesday, May 19 2010. Hardcopy in Class.

What to read: Chapters 5.1, 5.2 (Relational Algebra), and Chapters 15.1,  15.2, 15.3, 15.4, 15.5, 15.6 (Query execution).  Read the lecture notes.

What you will learn: understand the role of the relational algebra; learn to explain why queries are slow or fast.

 

 

1. Which of the following identities are correct? In each case, either answer YES, or answer NO and give a counterexample:

a.  δ(δ(R)) = δ(R)

b.   ΠLL(R))= ΠL(R),   where L = a set of attributes

c.  ΠKL(R))= ΠK,L(R),   where K, L = sets of attributes

d.   σCC(R)) = σC(R),   where C = a condition

e.   ΥL, agg(A)L, agg(A)(R)) = ΥL, agg(A)(R),   where L = a set of group-by attributes, agg is an aggregate operator, and A is an attribute


2. Consider the following relational schema, representing a database of documents and a set of keywords:

Doc(did, docTitle)
Occurs(did, wid)
Word(wid, word)
Keywords(word)

For each of the queries below, show an equivalent relational algebra plan:

a.       SELECT z.word, count(*)
      FROM Doc x, Occurs y, Word z
      WHERE x.did = y.did and y.wid = z.wid
      GROUP BY z.wid, z.word


b.       SELECT x.did, x.docTitle
      FROM Doc x
      WHERE not exists (     SELECT *
                                        FROM Keywords u
                                        WHERE not exists (     SELECT *
                                                                          FROM Occurs y, Word z
                                                                          WHERE x.did=y.did and y.wid=z.wid
                                                                           and z.word = u.word))

3. (based on Exercise 15.6.2, pp. 745 in the textbook) Suppose B(R) = 10,000 and T(R) = 500,000. The table is clustered.  Let there be an index on R.a, and let V(R,a) = k, for some number k. Give the cost of σa=0(R) as a function of k, under the following circumstances. You may neglect the disk I/OÕs needed to access the index itself.

a.  The index is clustered (the book says the index is clustering; it is the same thing)

b.  The index is not clustered

c.  No index is used.


4. Consider the two queries below:
-- q1
select c.mid, COUNT(*)
from Movie m, Casts c
where m.id = c.mid
group by c.mid

-- q2
select m.name, COUNT(*)
from Movie m, Casts c
where m.id = c.mid
group by m.name

a. Run these two queries on the IMDB database on SQL Server: one is about twice faster than the other.  Examine the query plans for these two queries, and explain the difference in running time.

b. The IMDB_FOREIGNKEYS database is identical to IMDB but has foreign keys declared in Casts. Examine the plans for the two queries on the IMDB_FOREIGNKEYS database, and explain which one changed from the IMDB database and why.