CSE 444 Homework 3

Objectives:
To understand the role of the relational algebra; learn to explain why queries are slow or fast.
Reading assignments:
Chapters 5.1, 5.2 (relational algebra), and Chapters 15.1, 15.2, 15.3, 15.4, 15.5, 15.6 (query execution). Read the lecture notes.
Number of points:
100
Due date:
February 25, 2011

Questions:

  1. [25 points] Which of the following identities are correct? In each case, either answer YES, or answer NO and give a counterexample:

    1. δ(δ(R)) = δ(R)

    2. ΠLL(R)) = ΠL(R), where L = a set of attributes

    3. ΠKL(R)) = ΠK,L(R), where K, L = sets of attributes

    4. σCC(R)) = σC(R), where C = a condition

    5. γL, agg(A)L, agg(A)(R)) = γL, agg(A)(R), where L = a set of group-by attributes, agg is an aggregate operator, and A is an attribute

  2. [25 points] Consider the following relational schema, representing a database of documents and a set of keywords:

    Doc(did, docTitle)
    Occurs(did, wid)
    Word(wid, word)
    Keywords(word)

    For each of the queries below, show an equivalent relational algebra plan:

    1. SELECT z.word, count(*)
      FROM Doc x, Occurs y, Word z
      WHERE x.did = y.did and y.wid = z.wid
      GROUP BY z.wid, z.word

    2. SELECT x.did, x.docTitle
      FROM Doc x
      WHERE not exists (SELECT *
                                     FROM Keywords u
                                     WHERE not exists (SELECT *
                                                                    FROM Occurs y, Word z
                                                                    WHERE x.did=y.did and y.wid=z.wid
                                                                    and z.word = u.word))

      Hint: Try to express what this query does in plain English before constructing the query plan.

  3. [25 points] (based on Exercise 15.6.2, pp. 745 in the textbook) Suppose B(R) = 10,000 and T(R) = 500,000. The table is clustered. Let there be an index on R.a, and let V(R,a) = k, for some number k. Give the cost of σa=0(R) as a function of k, under the following circumstances. You may neglect the disk I/O's needed to access the index itself.
    1. The index is clustered (the book says the index is clustering; it is the same thing).
    2. The index is not clustered.
    3. No index is used.

  4. [25 points] Consider the two queries below:
    -- q1
    select c.mid, COUNT(*)
    from Movie m, Casts c
    where m.id = c.mid
    group by c.mid

    -- q2
    select m.name, COUNT(*)
    from Movie m, Casts c
    where m.id = c.mid
    group by m.name

    1. Run these two queries on the IMDB database on IISQLSRV; one is about twice as fast as the other. Examine the query plans for these two queries, and explain the difference in running time.
    2. The IMDB_FOREIGNKEYS database is identical to IMDB but has foreign keys declared in Casts. Examine the plans for the two queries on the IMDB_FOREIGNKEYS database, and explain which one changed from the IMDB database and why.