CSE
544 Homework 2
Due: Friday, June 4th
Points: 100
1.
[15
points] We have n+1 relations, with the following schemas:
S(C1,C2,...,
Cn), R1(A1,B1),..., Rn(An,Bn)
Consider two join expressions:
E = R1
JoinB1=A2 R2 JoinB2=A3 . . . Rn-1 JoinBn-1=An Rn
F = S JoinC1=A1 R1 JoinC2=A2 R2 . . . Rn-1 JoinCn=An Rn
E is a chain join, and F is a star join. Assume that we restrict the joins
plans only to join trees without cartesian products. We distinguish between a
plan A Join B and a plan B Join
A. Compute how many plans exists for E
and for F, and how many subplans will be inspected by the dynamic programming
algorithm in the case where only left-linear join trees are allowed, and in the
case where arbitrary bushy trees are allowed. Your answers should consists of
eight functions in n, such as n!, or nn,
or 5n. You may want to consult the Catalan numbers, e.g. in Concrete Mathematics, by Graham, Knuth, and Patashnik, Example 4 on page 357.
2.
[15
points] Consider three relations: R(x,y), S(x),
U(x,y). Write a query qa(x) that computes the active domain of a database with that schema. Then, for each of the FO queries below
indicate whether they are safe or not (i.e. domain independent), and whether
they return a finite answer or not.
a.
{x
| S(x) Ù
"
y.(not R(x,y))}
b.
{
x | S(x) Ù
("y.
(R(x,y) Þ
$ z.(S(z) or U(y,z)))) }
c.
{
x | $y.(S(y)
Þ "z.
(R(x,y) and U(y,z))) }
d.
{
x | S(x) Ù"y.(S(y)
ÙR(x,y))
}
e.
{
x | S(x) Ù"y.(U(x,y)
Ú"z.
(not R(y,z))) }
f.
{
(x,y) | $z.
(R(x,z) ÚU(z,y))
}
3.
[15
points] Let T(x,y,z) and L(x) be two tables representing a tree: a triple
(x,y,z) in T says that x is the parent of y and z, while a node x in L
indicates that x is a leaf.
a.
We
say that two nodes u, v are on the same
level in the tree if either (a) u and v have the same parent, or (b) their
parents are on the same level. Write a
datalog query that returns all nodes that are on the same level as node a (here a is a constant).
b.
Alice
and Bob play the following pebble game
on the tree T. Alice places the pebble
on the root node, represented by the constant r. Next, Bob moves the pebble to one of the children of r, call it x1. Next, Alice
moves the pebble to one of the children of x1 call it x2. The game continues until the pebble reaches a
leaf, x. If A(x) is true then Alice
wins, otherwise Bob wins. Here we assume that A(x) is a predicate on the nodes
meaning “x is a leaf and Alice wins at
x”. Write a datalog query that checks if Alice has a winning strategy.
4. [20
points] Hash functions Mr. Frumble decided to
implement his favorite hash function as follows:
hash(s1s2...sk,
n) = (∑i=1,n si ) mod n
An implementation in Visual C++ is given here: hash.cpp
Most major database systems use this hash function, so Mr. Frumble thought he
won't be fired for implementing it this way. In this application he has to
match a large number of financial transactions with a set of 10000 bank
accounts, and wants to use the hash function to partition the accounts into 100
buckets. "That should give me about 100 accounts in each bucket, give or
take a few", thinks Mr. Fumble, "and for each new transaction I need
to search the matching bank account only in one bucket, so I should be able to
find the account after only 100 comparisons, on average". Mr. Frumble has
a point, since the accounts are really very uniformly distributed: namely, they
are the following strings::
A0000
A0001
A0002
. . . .
A9998
A9999
That is, each account starts with the letter 'A', followed by four digits. Each
of the ten thousand possible combinations of digits occurs exactly once. But,
to Mr. Frumble's surprise, the application runs much slower than he
expects. Each new transaction seems to
require significantly more than 100 comparisons.
a. Help
Mr. Frumble understand what is going on, by drawing the histogram of the bucket
sizes. This is a graph whose Ox axis has values 0, 1, 2, ..., 99 corresponding
to the buckets, and the Oy axis has values between 0 and 10000, showing the
number of accounts in that bucket. You have a turn in a graph generated by any
tool (say Excel, or gnuplot, or mathlab, etc). To generate the data for this
graph you may either write a C++ program (you may use the hash function hash.cpp) or you may derive a mathematical formula that
computes the size of each bucket then run it on Excel or on any other tool.
Please turn in any auxiliary material that you used to generate the graph.
b. Using
the data you collected at point (a), compute how much slower Mr. Frumble's
application runs compared to what he expected. The running time for the
application is quadratic in the number of items in a bucket: that is, if the
buckets have 4, 52, 9, 125, 58, ... items each, then the running time is 42+522+92+1252+582+...
Mr. Frumble expects a running time of 1002+1002+…+1002
since he expected all buckets to
have the same size. Compute how much
larger the real running time is compared to what Mr. Frumble expects.
c.
Design
a better hash function that would partition Mr. Frumble's accounts more
uniformly, and show the new histogram. Compute the running time at point (b)
for your new hash function. Notice that there is no best solution here, feel
free to use your creativity. You have to turn in both a program implementing
the new hash function, a histogram, and a number indicating the running time.
5.
[20
points] Query containment.
a.
Indicate
for each pair of queries q, q' below, whether q Í q'. If the answer is yes, provide a proof; if the
answer is no, give a database instance I on which q(I) Ë
q'(I).
i.
q(x)
:- R(x,y), R(y,z), R(z,x)
q'(x) :- R(x,y), R(y,z), R(z,u), R(u,v), R(v,z)
ii.
q(x,y)
:- R(x,u,u), R(u,v,w), R(w,w,y)
q'(x,y) :- R(x,u,v), R(v,v,v), R(v,w,y)
iii.
q()
:- R(u,u,x,y),R(x,y,v,w),v¹w
q'() :- R(u,u,x,y), x¹y
iv.
q(x)
:- R(x,y), R(y,z),R(z,v)
q'(x) :- R(x,y), R(y,z), y¹z
b.
Let
q1(x) :- R(x,y),R(y,z),R(z,u), q2(x) :- R(x,y),R(y,z).
Notice that q1 Í
q2. Give an example of a
conjunctive query q such that q1 Ì q and q Ì q2. Here q1 Ì
q means q1 Í q
and not q Í
q1.
c.
Consider
the following two queries:
q1(x) :- R(x,y),R(y,z),R(a,z)
q2(x) :- R(x,y), R(y,z),R(z,u), R(y,b)
Find two queries q and q' such that the following four conditions hold
simultaneously:
q Í q1,
q Í q2,
q1 Í
q', q2 Í
q'.
You should choose q and q' as "tight" as possible.
6.
[15
points] For each statement below indicate whether it is true or false. You do not have to provide any proof. (Note:
the answer to one of the statements below has a difficult proof. You don't have
to find the proof in the literature or understand it, but rely on your
intuition to provide a true/false answer).
a.
Every
query in FO has a data complexity which is in PTIME
b.
All
queries in FO are monotone.
c.
The
query complexity of conjunctive queries is NP complete.
d.
Every
query in datalogØ
is also in FO.
e.
There
exists a query in FO that is not expressible in datalog.
f.
If
a query can be expressed in FO and also in datalog, then it can be expressed in
CQ (= conjunctive queries).