CSE 444 Final
December
13, 2002
8:30-10:20
a.m.
Name___________________________________
Total
number of points: 100
Total
time: 1h 50’
1.[30 points] A Web search engine maintains a database with
the following tables:
Webpage(url, word)
Dictionary(word, language)
A webpage is uniquely identified by its URL. The relation Webpage
represents a many-to-many relationship from the set of urls to the set
of words: notice that one url may contain multiple words,
and conversely, one word may occur in multiple urls. The table Dictionary consists of a
many-to-many relationship between words and languages. Here too, one word may belong to
multiple languages. We say that a webpage is written in language L if all
its words are in the language L (a page may happen to be in multiple
languages). Write SQL queries to answer the questions below:
a. [10 points] Compute for each url the total number of words in that webpage.
b. [10 points] Retrieve all pairs of distinct urls which have more than 1000 words in common.
c. [10 points] Retrieve the urls of all pages written in “French”.
2. [20 points] Consider two tables R(A,B) and S(B,C), where
all attributes are integers. Let Q1, Q2 be the following SQL queries:
Q1: select R.A, S.C
from R, S
where R.B=S.B
Q2: select R.A, sum(S.C)
from R, S
where R.B = S.B
group by R.A
Assume M = 102, B(R) = B(S) = 10000,
V(R,B) = V(S, B) = 100. Give two optimal physical query execution plans for Q1
and Q2, and indicate their cost. Your plans may contain one or more instances
of the physical operators from the list below. You do not need to consider any
other physical operators.
Nested loop join or block nested loop join
Main memory hash join
Partitioned hash join
Main memory hash group-by
Partitioned hash group-by
3.[10 points] Consider three tables R(A,B,C), S(A,B,C), and T(D, E). A is a key in R. Answer the following questions. If you answer “yes” you don’t have to justify your answer; if you answer “no” you need to give an example that justifies your answer. All relational algebra operators below are on sets, i.e. they eliminate duplicates.
Is A a key in PAB(R) ?
Is A a key in sC=55(R)
?
Is A a key in R È S ?
Is A a key in R – S ?
Is A a key in R⋈ B=D T
4. [20 points] Compute the optimal plan for R ⋈S⋈ T ⋈ U using the dynamic
programming algorithm, assuming the following:
B(R) = 500, B(S) = 200, B(T) = 600,
B(U) = 400
The size of a join is estimated as: B(A⋈ B) = 0.01 * B(A) * B(B)
The cost of a join is estimated to be the cost of the subplans plus the size of
the intermediate result(s).
5. [20 points] The SuperSQL database system performs undo
recovery without checkpointing. It stores its undo log file in a table, with
the following schema:
Log(N, T, E, V)
where N is the entry number (0, 1, 2, 3, …), T is the transaction
id, E is the element id, and V is the old value. A log entry of the form <T, E, V> is
simply represented by the tuple (N, T, E, V), where N is the entry number; here
E > 0. The log entries <START T>, <COMMIT T>, and <ABORT
T> are represented by a tuple (N, T, E, null), where E=-1 for START, E=-2
for COMMIT, and E=-3 for ABORT. For example, the log:
<START T1> |
<T1 X1 55> |
<START T2> |
<T2 X2 99> |
<COMMIT T1> |
. . . . |
Is represented by the table:
N |
T |
E |
V |
0 |
T1 |
-1 |
|
1 |
T1 |
X1 |
55 |
2 |
T2 |
-1 |
|
3 |
T2 |
X2 |
99 |
4 |
T1 |
-2 |
|
. . . |
|
|
|
Recall that each transaction starts and ends at most once,
i.e. a sequence <START T> … <COMMIT T> … <START T> … will not
occur in the log. Moreover, any update by the transaction T will occur between
its <START T> and <COMMIT T>, or between <START T> and
<ABORT T> respectively.
Write a SQL query that can be run during database recovery, after a system
crash. The answer to your query should return a table with two attributes, E,
and V, indicating which elements have to be updated with what values.
You should include each element E at most once in your answer: otherwise
it would be ambiguous how to update it.