Name:__________________

CSE 544 --- Final Exam

June 10, 2002

2:30pm – 4:20pm

Time: 1h50’

Points: 100

This is an open book exam (solution)

1. [25points] A document d is a set of words, d = {w₁, ..., w_n}. We ignore the

word order in the document. Consider a database of French

documents, English documents, and a French/English dictionary. The

dictionary associates a confidence factor between 0 and 1 to a

pair consisting of a French and an English word. The schema is:

FrenchDocs(fID, fWord)

EnglishDocs(eID, eWord)

Dictionary(fWord, eWord, confidence)

Each document has a unique identifier (either an eID or fID).

Thus, given a document identifier, the corresponding document is

the collection of words associated with that identifier.

Write SQL queries to answer each of the following:

a. [5points] For each French document, count how many words it contains.

b. [5points] Find all other English documents that have at least 100

words in common with the English document eID = "e424".

c. [5points] Given two words, fWord, eWord, define their correlation,
c(fWord, eWord) to be the unqiue number c s.t. (fWord, eWord, c) is in

Dictionary, or 0 if they don't occur in the dictionary. Given two

documents d₁ = {w₁, ..., w_m} and d₂ = {v₁, ..., v_n}, define their

correlation to be: S_i,j(c(w_i,v_j))/Ö(m*n). We say that fID is a

translation of eID if their correlation is >= 0.9. Write a SQL

query that retrieves all French documents that are translations of

"e424". Assume that the DBMS supports sqrt for computing the square root.

d. [5points] Given a French word w, define its translation, w' = t(w) to be

the English word s.t. (w, w', 1.0) is in Dictionary: we assume that

for each w there exists at most one such w'. Given a document D =

{w₁, ..., w_n}, define its translation to be t(D) = {t(w₁), ...,

t(w_n)} (there may be fewer than n words, if some of the t(w_i)

values are undefined). Write a SQL query that inserts an English

translation of the French document "f822" into the database. Call

the new English document "e916".

e. [5points] The following conjecture has been suggested by an expert

familiar with the application domain: "all English documents that

contain the words 'vacation' and 'summer' also contain the words

'hiking' and 'biking'. Write a SQL query to prove or disprove this

conjecture. Explain how the result of the SQL query can be used to

prove or disprove the conjecture.

2. [15points] We have the following DTD:

<!ELEMENT vacation (region*)>

<!ELEMENT region (name, package*)>

<!ELEMENT package (hotel, airline, month, price)>

(all other elements are assumed to be #PCDATA)

a. [5points] write XPath expressions to return the following:

- return all package elements that fly "Northwest Airlines" and

are under $2000. Assume that prices are expressed in

integers, i.e. a test "<= 2000" should suffice.

- return all hotel elements in the "North West" region that are

available in "December"

It is OK to include duplicates in the answers (XPath does not

eliminate duplicates).

b. [10points] write an XQuery expression that translates the document into the

following structure:

<!ELEMENT vacation (hotel*)>

<!ELEMENT hotel (name, package*)>

<!ELEMENT package (region, airline, month, price)>

Your result should have a single entry for each hotel, even if that

hotel occurs under several packages, or under several regions in the input.

3. [15points] Consider a graph with colored edges, G(x,c,y), where x,y are nodes

and c is a color.

a. [5points] We say that x is a "red" node if all its outgoing edges are

"red". Write a FO query that returns all red nodes.

b. [5points] If (x,c,y) is in G, then we say that y is a c-neighbor of x.

Write an FO query that returns all nodes where all "blue"-neighbors

have at least one "green"-neighbor that is a "red" node.

c. [5points] Write a datalog program to compute the pairs of nodes (x,y)

connected by a path in which all edges have the same color. Your program should

compute a binary predicate Answer(x,y) consisting of all pairs (x,y) in the answer.

4. [25points]
a. [10points] Assume a B+ tree with d=1, i.e. each node has between 1 and 2

      keys. Starting from an empty tree, insert the keys 1, 2, 3, ..., 16, in this order.
      Show at least the following four trees:
              after inserting 5, after inserting 7, after inserting 14, after inserting 16.

b. [5points] We sort a file using multi-way merge-join. The main memory can

store M = 101 pages. A record is as large as one block (=page).

There are 10000 records, with the following keys (in this order):

1 2 ... 99 100000 101 102 ...199 200000 201 202 ...299 300000 ... 9901 9902 ... 9999 10000000

These are all keys 1, 2, ..., 10000 in sequence, with every 100th

key multiplied by 1000. The sorted file will be:

1 2 ... 99 101 102 ...199 201 202 ...299... 9901 9902 ... 9999 100000 200000 300000 … 10000000

How many times do we need to read and

write each record from/to disk if

(i) we use quicksort during run formation:

(ii) we use replacement selection during run formation:

c. [10points] Consider two tables, R(A,B) and S(C,D), and the selection-join operation

s_A=v(R) ´_B=C S. Assume that a record is a large as a block and that
T(R)=B(R)=B(S)=10000. There is a

clustered index on S.C, and we need four disk accesses to lookup a

value in that index (including the disk access to get the record in

S). We consider the following methods for the join:

- block nested loop join

- partitioned hash-join

- index join

In each of the four cases below, indicate the best access method its cost.

	V(R,A) = 1	V(R,A) = 10000
M = 101
M = 5002

5. [20points] Consider the following relational schema:

Product(name, manufacturer, city)

The following XQuery specification defines an XML view over the

relational schema. Here $db denotes a canonical relational view of

the relational schema:

<sales-by-city>

let $c-all := distinct-values($db/Product/tuple/city/text())

for $c in $c-all

return

<city>

{ let $m-all := distinct-values($db/Product/tuple[city/text()=$c]/manufacturer/text())

for $m in $m-all

return

{ for $p in $db/Product/tuple[city/text()=$c][manufacturer/text()=$m]

return <product> $p/name/text() </product>

}

</manufacturer>

}

</city>

</sales-by-city>

a. [5points] write the DTD of the XML view.

b. [15points] translate the following XPath queries into SQL.
Your SQL queries have to return the same number of occurrences,

i.e. if "Gizmo Plus" occurs three times in the result of the

second XPath expression, then it must also occur three times in

the SQL's answer. The order in your answer is irrelevant.

(i) [5points] /sales-by-city/city

(ii) [5points] /sales-by-city/city[name/text()="Seattle"]//product/text()
(iii) [5points] /sales-by-city/city/manufacturer/name/text()