**CSE
544 Homework 2**

Due: Friday, June 4th

Points: 100

1.
[15
points] We have n+1 relations, with the following schemas:

S(C_{1},C_{2},...,
C_{n}), R_{1}(A_{1},B_{1}),..., R_{n}(A_{n},B_{n})

Consider two join expressions:

E = R_{1}
Join_{B1=A2} R_{2} Join_{B2=A3} . . . R_{n-1} Join_{Bn-1=An} R_{n}

F = S Join_{C1=A1} R_{1} Join_{C2=A2} R_{2} . . . R_{n-1} Join_{Cn=An} R_{n}

E is a chain join, and F is a star join. Assume that we restrict the joins
plans only to join trees without cartesian products. We distinguish between a
plan A Join B and a plan B Join
A. Compute how many plans exists for E
and for F, and how many subplans will be inspected by the dynamic programming
algorithm in the case where only left-linear join trees are allowed, and in the
case where arbitrary bushy trees are allowed. Your answers should consists of
eight functions in n, such as n!, or n^{n},
or 5^{n}. You may want to consult the Catalan numbers, e.g. in *Concrete Mathematics*, by Graham, Knuth, and Patashnik, Example 4 on page 357.

2.
[15
points] Consider three relations: R(x,y), S(x),
U(x,y). Write a query qa(x) that computes the active domain of a database with that schema. Then, for each of the FO queries below
indicate whether they are safe or not (i.e. domain independent), and whether
they return a finite answer or not.

a.
{x
| S(x) Ù
"
y.(not R(x,y))}

b.
{
x | S(x) Ù
("y.
(R(x,y) Þ
$ z.(S(z) or U(y,z)))) }

c.
{
x | $y.(S(y)
Þ "z.
(R(x,y) and U(y,z))) }

d.
{
x | S(x) Ù"y.(S(y)
ÙR(x,y))
}

e.
{
x | S(x) Ù"y.(U(x,y)
Ú"z.
(not R(y,z))) }

f.
{
(x,y) | $z.
(R(x,z) ÚU(z,y))
}

3.
[15
points] Let T(x,y,z) and L(x) be two tables representing a tree: a triple
(x,y,z) in T says that x is the parent of y and z, while a node x in L
indicates that x is a leaf.

a.
We
say that two nodes u, v are *on the same
level* in the tree if either (a) u and v have the same parent, or (b) their
parents are on the same level. Write a
datalog query that returns all nodes that are on the same level as node **a** (here **a** is a constant).

b.
Alice
and Bob play the following *pebble *game
on the tree T. Alice places the pebble
on the root node, represented by the constant **r**. Next, Bob moves the pebble to one of the children of **r**, call it x_{1}. Next, Alice
moves the pebble to one of the children of x_{1} call it x_{2}. The game continues until the pebble reaches a
leaf, x. If A(x) is true then Alice
wins, otherwise Bob wins. Here we assume that A(x) is a predicate on the nodes
meaning “x is a leaf and Alice wins at
x”. Write a datalog query that checks if Alice has a winning strategy.

4. [20
points] Hash functions Mr. Frumble decided to
implement his favorite hash function as follows:

hash(s_{1}s_{2}...s_{k},
n) = (∑_{i=1,n} s_{i} ) mod n

An implementation in Visual C++ is given here: hash.cpp

Most major database systems use this hash function, so Mr. Frumble thought he
won't be fired for implementing it this way. In this application he has to
match a large number of financial transactions with a set of 10000 bank
accounts, and wants to use the hash function to partition the accounts into 100
buckets. "That should give me about 100 accounts in each bucket, give or
take a few", thinks Mr. Fumble, "and for each new transaction I need
to search the matching bank account only in one bucket, so I should be able to
find the account after only 100 comparisons, on average". Mr. Frumble has
a point, since the accounts are really very uniformly distributed: namely, they
are the following strings::

A0000

A0001

A0002

. . . .

A9998

A9999

That is, each account starts with the letter 'A', followed by four digits. Each
of the ten thousand possible combinations of digits occurs exactly once. But,
to Mr. Frumble's surprise, the application runs much slower than he
expects. Each new transaction seems to
require significantly more than 100 comparisons.

a. Help
Mr. Frumble understand what is going on, by drawing the histogram of the bucket
sizes. This is a graph whose Ox axis has values 0, 1, 2, ..., 99 corresponding
to the buckets, and the Oy axis has values between 0 and 10000, showing the
number of accounts in that bucket. You have a turn in a graph generated by any
tool (say Excel, or gnuplot, or mathlab, etc). To generate the data for this
graph you may either write a C++ program (you may use the hash function hash.cpp) or you may derive a mathematical formula that
computes the size of each bucket then run it on Excel or on any other tool.
Please turn in any auxiliary material that you used to generate the graph.

b. Using
the data you collected at point (a), compute how much slower Mr. Frumble's
application runs compared to what he expected. The running time for the
application is quadratic in the number of items in a bucket: that is, if the
buckets have 4, 52, 9, 125, 58, ... items each, then the running time is 4^{2}+52^{2}+9^{2}+125^{2}+58^{2}+...
Mr. Frumble expects a running time of 100^{2}+100^{2}+…+100^{2
} since he expected all buckets to
have the same size. Compute how much
larger the real running time is compared to what Mr. Frumble expects.

c.
Design
a better hash function that would partition Mr. Frumble's accounts more
uniformly, and show the new histogram. Compute the running time at point (b)
for your new hash function. Notice that there is no best solution here, feel
free to use your creativity. You have to turn in both a program implementing
the new hash function, a histogram, and a number indicating the running time.

5.
[20
points] Query containment.

a.
Indicate
for each pair of queries q, q' below, whether q Í q'. If the answer is yes, provide a proof; if the
answer is no, give a database instance I on which q(I) Ë
q'(I).

i.
q(x)
:- R(x,y), R(y,z), R(z,x)

q'(x) :- R(x,y), R(y,z), R(z,u), R(u,v), R(v,z)

ii.
q(x,y)
:- R(x,u,u), R(u,v,w), R(w,w,y)

q'(x,y) :- R(x,u,v), R(v,v,v), R(v,w,y)

iii.
q()
:- R(u,u,x,y),R(x,y,v,w),v¹w

q'() :- R(u,u,x,y), x¹y

iv.
q(x)
:- R(x,y), R(y,z),R(z,v)

q'(x) :- R(x,y), R(y,z), y¹z

b.
Let
q_{1}(x) :- R(x,y),R(y,z),R(z,u), q_{2}(x) :- R(x,y),R(y,z).
Notice that q_{1} Í
q_{2}. Give an example of a
conjunctive query q such that q_{1} Ì q and q Ì q_{2}. Here q_{1} Ì
q means q_{1} Í q
and not q Í
q_{1}.

c.
Consider
the following two queries:

q_{1}(x) :- R(x,y),R(y,z),R(a,z)

q_{2}(x) :- R(x,y), R(y,z),R(z,u), R(y,b)

Find two queries q and q' such that the following four conditions hold
simultaneously:

q Í q_{1},
q Í q_{2},
q_{1} Í
q', q_{2} Í
q'.

You should choose q and q' as "tight" as possible.

6.
[15
points] For each statement below indicate whether it is true or false. You do not have to provide any proof. (Note:
the answer to one of the statements below has a difficult proof. You don't have
to find the proof in the literature or understand it, but rely on your
intuition to provide a true/false answer).

a.
Every
query in FO has a data complexity which is in PTIME

b.
All
queries in FO are monotone.

c.
The
query complexity of conjunctive queries is NP complete.

d.
Every
query in datalog^{Ø}
is also in FO.

e.
There
exists a query in FO that is not expressible in datalog.

f.
If
a query can be expressed in FO and also in datalog, then it can be expressed in
CQ (= conjunctive queries).