# CSE 444 Homework 3

Objectives:
To understand the role of the relational algebra; learn to explain why queries are slow or fast.
Chapters 5.1, 5.2 (relational algebra), and Chapters 15.1, 15.2, 15.3, 15.4, 15.5, 15.6 (query execution). Read the lecture notes.
Number of points:
100
Due date:
February 25, 2011

Questions:

1. [25 points] Which of the following identities are correct? In each case, either answer YES, or answer NO and give a counterexample:

1. δ(δ(R)) = δ(R)

2. ΠLL(R)) = ΠL(R), where L = a set of attributes

3. ΠKL(R)) = ΠK,L(R), where K, L = sets of attributes

4. σCC(R)) = σC(R), where C = a condition

5. γL, agg(A)L, agg(A)(R)) = γL, agg(A)(R), where L = a set of group-by attributes, agg is an aggregate operator, and A is an attribute

2. [25 points] Consider the following relational schema, representing a database of documents and a set of keywords:

Doc(did, docTitle)
Occurs(did, wid)
Word(wid, word)
Keywords(word)

For each of the queries below, show an equivalent relational algebra plan:

1. SELECT z.word, count(*)
FROM Doc x, Occurs y, Word z
WHERE x.did = y.did and y.wid = z.wid
GROUP BY z.wid, z.word

2. SELECT x.did, x.docTitle
FROM Doc x
WHERE not exists (SELECT *
FROM Keywords u
WHERE not exists (SELECT *
FROM Occurs y, Word z
WHERE x.did=y.did and y.wid=z.wid
and z.word = u.word))

Hint: Try to express what this query does in plain English before constructing the query plan.

3. [25 points] (based on Exercise 15.6.2, pp. 745 in the textbook) Suppose B(R) = 10,000 and T(R) = 500,000. The table is clustered. Let there be an index on R.a, and let V(R,a) = k, for some number k. Give the cost of σa=0(R) as a function of k, under the following circumstances. You may neglect the disk I/O's needed to access the index itself.
1. The index is clustered (the book says the index is clustering; it is the same thing).
2. The index is not clustered.
3. No index is used.

4. [25 points] Consider the two queries below:
-- q1
select c.mid, COUNT(*)
from Movie m, Casts c
where m.id = c.mid
group by c.mid

-- q2
select m.name, COUNT(*)
from Movie m, Casts c
where m.id = c.mid
group by m.name

1. Run these two queries on the IMDB database on IISQLSRV; one is about twice as fast as the other. Examine the query plans for these two queries, and explain the difference in running time.
2. The IMDB_FOREIGNKEYS database is identical to IMDB but has foreign keys declared in Casts. Examine the plans for the two queries on the IMDB_FOREIGNKEYS database, and explain which one changed from the IMDB database and why.