CSE 444 Homework 3

Total points: 100

Due date: Wednesday, Dec. 2 at 11 pm.

What to review: lectures 15 through 20

What you will learn: understand the role of relational algebra; learn to explain why queries are slow or fast.

(We apologize for the crummy formatting of this writeup. Someone edited the html page using Microsoft Word and we don't have the time to remove all the garbage formatting codes that wound up in the file.)

1. Consider the following B+ Tree:

b+ tree example

Show the state of the B+ tree after inserting elements with key values: 51, 52, and 16.
Starting from the original tree above, show the state of the B+ tree after removing 65 and then 60. If a node has too few keys, you can only move keys between siblings (another node with the same parent). If no such node has enough keys, you need to merge nodes.

2. Which of the following identities are correct? In each case, either answer YES, or answer NO and give a counterexample:

a. δ(δ(R)) = δ(R)

b. Π_L(Π_L(R))= Π_L(R), where L = a set of attributes

c. Π_K(Π_L(R))= Π_K,L(R), where K, L = sets of attributes

d. σ_C1(σ_C2(R)) = σ_{C1 and C2}(R), where C1 = a condition and C2 = a condition

e. Υ_{L, agg(A)}(Υ_L,
agg(A)(R)) = Υ_{L, agg(A)}(R), where L = a set of group-by attributes, agg is an aggregate operator, and A is an attribute

3. Consider the following relational schema, representing a database of documents:

Doc(did, docTitle)
Occurs(did, wid)
Word(wid, word)

For the following query, show an equivalent relational algebra plan:

SELECT z.word, count(*)
FROM Doc x, Occurs y, Word z
WHERE x.did = y.did and y.wid = z.wid and wid > 100
GROUP BY z.wid, z.word

4. (based on Exercise 15.6.2, pp. 745 in the textbook) Suppose B(R) = 10,000 and T(R) = 500,000. The table is clustered. Let there be an index on R.a, and let V(R,a) = k, for some number k. Give the cost of σa=3(R) as a function of k, under the following circumstances. You may neglect the disk I/O’s needed to access the index itself.

a. The index is clustered (the book says the index is clustering; it is the same thing)

b. The index is not clustered

c. No index is used.

5. Consider the two queries below:
-- q1
select c.mid, COUNT(*)
from Movie m, Casts c
where m.id = c.mid
group by c.mid

-- q2
select m.name, COUNT(*)
from Movie m, Casts c
where m.id = c.mid
group by m.name

Run these two queries on the IMDB database on SQL Server: one is about twice faster than the other. Examine the query plans for these two queries, and explain the difference in running time.