June 10, 2002
2:30pm – 4:20pm
Time: 1h50’
Points: 100
This is an open book exam (solution)
1. [25points] A document d is a set of words, d = {w1, ..., wn}. We ignore the
word order in the document. Consider a database of French
documents, English documents, and a French/English dictionary. The
dictionary associates a confidence factor between 0 and 1 to a
pair consisting of a French and an English word. The schema is:
FrenchDocs(fID,
fWord)
EnglishDocs(eID,
eWord)
Dictionary(fWord,
eWord, confidence)
Each document has a unique identifier (either an eID or fID).
Thus, given a document identifier, the corresponding document is
the collection of words associated with that identifier.
Write SQL queries to answer each of the following:
a. [5points] For
each French document, count how many words it contains.
b. [5points] Find all other English documents that have at least 100
words in common with
the English document eID = "e424".
c. [5points] Given
two words, fWord, eWord, define their correlation,
c(fWord, eWord) to be the unqiue number c s.t. (fWord,
eWord, c) is in
Dictionary, or 0 if they don't occur in the dictionary. Given two
documents d1 = {w1, ..., wm} and d2 = {v1, ..., vn}, define their
correlation to be: Si,j(c(wi,vj))/Ö(m*n). We say that fID is a
translation of eID if their correlation is >= 0.9. Write a SQL
query that retrieves all French documents that are translations of
"e424". Assume that the DBMS supports sqrt for computing the square root.
d. [5points] Given a French word w, define its translation, w' = t(w) to be
the English word s.t. (w, w', 1.0) is in Dictionary: we assume that
for each w there exists at most one such w'. Given a document D =
{w1, ..., wn}, define its translation to be t(D) = {t(w1), ...,
t(wn)} (there may be fewer than n words, if some of the t(wi)
values are undefined). Write a SQL query that inserts an English
translation of the French document "f822" into the database. Call
the new English document "e916".
e. [5points] The following conjecture has been suggested by an expert
familiar with the application domain: "all English documents that
contain the words 'vacation' and 'summer' also contain the words
'hiking' and 'biking'. Write a SQL query to prove or disprove this
conjecture. Explain how the result of the SQL query can be used to
prove or disprove the conjecture.
2. [15points] We have the following DTD:
<!ELEMENT vacation (region*)>
<!ELEMENT region (name, package*)>
<!ELEMENT package (hotel, airline, month, price)>
(all other elements are assumed to be #PCDATA)
a. [5points] write XPath expressions to return the following:
- return all package elements that fly "Northwest Airlines" and
are under $2000. Assume that prices are expressed in
integers, i.e. a test "<= 2000" should suffice.
- return all hotel elements in the "North West" region that are
available in "December"
It is OK to include duplicates in the answers (XPath does not
eliminate duplicates).
b. [10points] write an XQuery expression that translates the document into the
following structure:
<!ELEMENT vacation
(hotel*)>
<!ELEMENT hotel (name,
package*)>
<!ELEMENT package (region,
airline, month, price)>
Your result should have a single entry for each hotel, even if that
hotel occurs under several packages, or under several regions in the input.
3. [15points] Consider a graph with colored edges, G(x,c,y), where x,y are nodes
and c is a color.
a. [5points] We say that x is a "red" node if all its outgoing edges are
"red". Write a FO query that returns all red nodes.
b. [5points] If (x,c,y) is in G, then we say that y is a c-neighbor of x.
Write an FO query that returns all nodes where all "blue"-neighbors
have at least one "green"-neighbor that is a "red" node.
c. [5points] Write a datalog program to compute the pairs of nodes (x,y)
connected by a path in which all edges have the same color. Your program should
compute a binary predicate Answer(x,y) consisting of all pairs (x,y) in the answer.
4. [25points]
a. [10points] Assume a B+ tree with d=1, i.e. each node has between 1 and 2
keys. Starting
from an empty tree, insert the keys 1, 2, 3, ..., 16, in this order.
Show at least the following four
trees:
after inserting 5, after
inserting 7, after inserting 14, after inserting 16.
b. [5points] We sort a file using multi-way merge-join. The main memory can
store M = 101 pages. A record is as large as one block (=page).
There are 10000 records, with the following keys (in this order):
1 2 ... 99 100000 101 102
...199 200000 201 202 ...299 300000 ... 9901 9902 ... 9999 10000000
These are all keys 1, 2, ..., 10000 in sequence, with every 100th
key multiplied by 1000. The sorted file will be:
1 2 ... 99 101 102 ...199 201
202 ...299... 9901 9902 ... 9999 100000 200000 300000 … 10000000
How many times do we need to read and
write each record
from/to disk if
(i) we use quicksort during run formation:
(ii)
we use replacement selection during run formation:
c. [10points] Consider two tables, R(A,B) and S(C,D), and the selection-join operation
sA=v(R)
´B=C
S. Assume that a record is a large
as a block and that
T(R)=B(R)=B(S)=10000. There is a
clustered index on S.C, and we need four disk accesses to lookup a
value in that index (including the disk access to get the record in
S). We consider the following methods for the join:
- block nested loop join
- partitioned hash-join
- index join
In each of the
four cases below, indicate the best access method its cost.
|
V(R,A) = 1 |
V(R,A) = 10000 |
M = 101 |
|
|
M = 5002 |
|
|
5. [20points] Consider the following relational schema:
Product(name, manufacturer, city)
The following XQuery specification defines an XML view over the
relational schema. Here $db denotes a canonical relational view of
the relational schema:
<sales-by-city>
let $c-all := distinct-values($db/Product/tuple/city/text())
for $c in $c-all
return
<city>
<name> { $c } </name>
{ let $m-all := distinct-values($db/Product/tuple[city/text()=$c]/manufacturer/text())
for $m in $m-all
return
<manufacturer>
<name> $m </name>
{ for $p in $db/Product/tuple[city/text()=$c][manufacturer/text()=$m]
return <product> $p/name/text() </product>
}
</manufacturer>
}
</city>
</sales-by-city>
a. [5points] write the DTD of the XML view.
b. [15points]
translate the following XPath queries into SQL.
Your SQL queries have to return the
same number of occurrences,
i.e. if "Gizmo Plus" occurs three times in the result of the
second XPath expression, then it must also occur three times in
the SQL's answer. The order in your answer is irrelevant.
(i) [5points] /sales-by-city/city
(ii)
[5points]
/sales-by-city/city[name/text()="Seattle"]//product/text()
(iii) [5points]
/sales-by-city/city/manufacturer/name/text()