CSE 544 Homework 4

CSE 544 Homework 2

Due: Friday, June 4th

Points: 100

1.    [15 points] We have n+1 relations, with the following schemas:

       S(C₁,C₂,..., C_n), R₁(A₁,B₁),..., R_n(A_n,B_n)

Consider two join expressions:

     E = R₁ Join_B1=A2 R₂ Join_B2=A3 . . . R_n-1 Join_Bn-1=An   R_n

     F = S Join_C1=A1 R₁ Join_C2=A2 R₂ . . . R_n-1 Join_Cn=An R_n

E is a chain join, and F is a star join. Assume that we restrict the joins plans only to join trees without cartesian products. We distinguish between a plan A Join B   and a plan B Join A. Compute how many plans exists for E and for F, and how many subplans will be inspected by the dynamic programming algorithm in the case where only left-linear join trees are allowed, and in the case where arbitrary bushy trees are allowed. Your answers should consists of eight functions in n, such as n!, or nⁿ, or 5ⁿ. You may want to consult the Catalan numbers, e.g. in Concrete Mathematics, by Graham, Knuth, and Patashnik, Example 4 on page 357.

2. [15 points] Consider three relations: R(x,y), S(x), U(x,y). Write a query qa(x) that computes the active domain of a database with that schema. Then, for each of the FO queries below indicate whether they are safe or not (i.e. domain independent), and whether they return a finite answer or not.

a. {x | S(x) Ù " y.(not R(x,y))}

b. { x | S(x) Ù ("y. (R(x,y) Þ $ z.(S(z) or U(y,z)))) }

c. { x | $y.(S(y) Þ "z. (R(x,y) and U(y,z))) }

d. { x | S(x) Ù"y.(S(y) ÙR(x,y)) }

e. { x | S(x) Ù"y.(U(x,y) Ú"z. (not R(y,z))) }

f. { (x,y) | $z. (R(x,z) ÚU(z,y)) }

3. [15 points] Let T(x,y,z) and L(x) be two tables representing a tree: a triple (x,y,z) in T says that x is the parent of y and z, while a node x in L indicates that x is a leaf.

a. We say that two nodes u, v are on the same level in the tree if either (a) u and v have the same parent, or (b) their parents are on the same level. Write a datalog query that returns all nodes that are on the same level as node a (here a is a constant).

b. Alice and Bob play the following pebble game on the tree T. Alice places the pebble on the root node, represented by the constant r. Next, Bob moves the pebble to one of the children of r, call it x₁. Next, Alice moves the pebble to one of the children of x₁ call it x₂. The game continues until the pebble reaches a leaf, x. If A(x) is true then Alice wins, otherwise Bob wins. Here we assume that A(x) is a predicate on the nodes meaning “x is a leaf and Alice wins at x”. Write a datalog query that checks if Alice has a winning strategy.

4.    [20 points] Hash functions Mr. Frumble decided to implement his favorite hash function as follows:

            hash(s₁s₂...s_k, n) = (∑_i=1,n s_i ) mod n

An implementation in Visual C++ is given here: hash.cpp
Most major database systems use this hash function, so Mr. Frumble thought he won't be fired for implementing it this way. In this application he has to match a large number of financial transactions with a set of 10000 bank accounts, and wants to use the hash function to partition the accounts into 100 buckets. "That should give me about 100 accounts in each bucket, give or take a few", thinks Mr. Fumble, "and for each new transaction I need to search the matching bank account only in one bucket, so I should be able to find the account after only 100 comparisons, on average". Mr. Frumble has a point, since the accounts are really very uniformly distributed: namely, they are the following strings::

    A0000
    A0001
    A0002
    . . . .
    A9998
    A9999

That is, each account starts with the letter 'A', followed by four digits. Each of the ten thousand possible combinations of digits occurs exactly once. But, to Mr. Frumble's surprise, the application runs much slower than he expects. Each new transaction seems to require significantly more than 100 comparisons.

a. Help Mr. Frumble understand what is going on, by drawing the histogram of the bucket sizes. This is a graph whose Ox axis has values 0, 1, 2, ..., 99 corresponding to the buckets, and the Oy axis has values between 0 and 10000, showing the number of accounts in that bucket. You have a turn in a graph generated by any tool (say Excel, or gnuplot, or mathlab, etc). To generate the data for this graph you may either write a C++ program (you may use the hash function hash.cpp) or you may derive a mathematical formula that computes the size of each bucket then run it on Excel or on any other tool. Please turn in any auxiliary material that you used to generate the graph.

b. Using the data you collected at point (a), compute how much slower Mr. Frumble's application runs compared to what he expected. The running time for the application is quadratic in the number of items in a bucket: that is, if the buckets have 4, 52, 9, 125, 58, ... items each, then the running time is 4²+52²+9²+125²+58²+... Mr. Frumble expects a running time of 100²+100²+…+100² since he expected all buckets to have the same size. Compute how much larger the real running time is compared to what Mr. Frumble expects.

c. Design a better hash function that would partition Mr. Frumble's accounts more uniformly, and show the new histogram. Compute the running time at point (b) for your new hash function. Notice that there is no best solution here, feel free to use your creativity. You have to turn in both a program implementing the new hash function, a histogram, and a number indicating the running time.

5. [20 points] Query containment.

a. Indicate for each pair of queries q, q' below, whether q Í q'. If the answer is yes, provide a proof; if the answer is no, give a database instance I on which q(I) Ë q'(I).

i. q(x) :- R(x,y), R(y,z), R(z,x)
q'(x) :- R(x,y), R(y,z), R(z,u), R(u,v), R(v,z)

ii. q(x,y) :- R(x,u,u), R(u,v,w), R(w,w,y)
q'(x,y) :- R(x,u,v), R(v,v,v), R(v,w,y)

iii. q() :- R(u,u,x,y),R(x,y,v,w),v¹w
q'() :- R(u,u,x,y), x¹y

iv. q(x) :- R(x,y), R(y,z),R(z,v)
q'(x) :- R(x,y), R(y,z), y¹z

b. Let q₁(x) :- R(x,y),R(y,z),R(z,u), q₂(x) :- R(x,y),R(y,z). Notice that q₁ Í q₂. Give an example of a conjunctive query q such that q₁ Ì q and q Ì q₂. Here q₁ Ì q means q₁ Í q and not q Í q₁.

c. Consider the following two queries:
q₁(x) :- R(x,y),R(y,z),R(a,z)
q₂(x) :- R(x,y), R(y,z),R(z,u), R(y,b)
Find two queries q and q' such that the following four conditions hold simultaneously:
q Í q₁, q Í q₂, q₁ Í q', q₂ Í q'.
You should choose q and q' as "tight" as possible.

6. [15 points] For each statement below indicate whether it is true or false. You do not have to provide any proof. (Note: the answer to one of the statements below has a difficult proof. You don't have to find the proof in the literature or understand it, but rely on your intuition to provide a true/false answer).

a. Every query in FO has a data complexity which is in PTIME

b. All queries in FO are monotone.

c. The query complexity of conjunctive queries is NP complete.

d. Every query in datalog^Ø is also in FO.

e. There exists a query in FO that is not expressible in datalog.

f. If a query can be expressed in FO and also in datalog, then it can be expressed in CQ (= conjunctive queries).