CSEP 544 Homework 5
Objectives
To understand query execution, recovery and concurrency control, datalog, and provenance.
Reading Assignments
Required: lecture notes, Chapters 16,17,18 from Ramakrishan&Gehrke (R&G). Optional: Chapter 18 from Database Systems: the Complete Book (DS:CB). Lecture notes on datalog and on provenance.
Number of points:
100
Due Date
March 10, 2014, at 11:00 pm in the dropbox
- For each of the following
schedules:
a) r1(A); r2(A); r3(B);
w1(A); r2(C); r2(B);
w2(B); w1(C);
b) r1(A); w1(B); r2(B);
w2(C); r3(C); w3(A);
d) r1(A); r2(A); w1(B);
w2(B); r1(B); r2(B);
w2(C); w1(D);
Answer the following questions:
i. Draw the precedence graph for the schedule.
ii. Indicate whether the schedule conflict-serializable. If so, then indicate all the equivalent serial schedules.
- Fix a schedule S. Say that a transaction T precedes
a transaction U every action of T
precedes every action of U. Note that if T and
U are the only transactions in S, then S is the serial
schedule (T, U). However, if S involves transactions other
than T and U, then S might not be a serial schedule, and
in fact, may not even be conflict-serializable, because of the effect of other transactions. Give an example of a schedule S
such that:
i. In S, T1 precedes T2, and
ii. S is conflict-serializable, but
iii. In every serial schedule conflict-equivalent to S,
T2 precedes T1.
- Consider a concurrency-control system based on time stamps. Below are several sequences of events,
including start events, where st1; means that transaction
T1 starts. These sequences represent real time, and
the timestamp scheduler will allocate timestamps to transactions
in the order of their starts. Tell what happens as each executes. You need to indicate the actions below that case the transaction to be rolled back or to be delayed.
a) st1; st2; r1(A); r2(B);
w2(A); w1(B);
b) st1; r1(A); st2; w2(B);
r2(A); w1(B);
c) st1; st2; st3;
r1(A); r2(B); w1(C);
r3(B); r3(C); w2(B);
w3(A);
d) st1; st3; st2;
r1(A); r2(B); w1(C);
r3(B); r3(C); w2(B);
w3(A);
- Consider a concurrency-control system based on validation. In the following sequence of events, we use
Ri(X); to mean "transaction Ti starts, and its read
set is the list of database elements X." Also, Vi
means "Ti attempts to validate," and Wi(X); means
that "Ti finishes, and its write set was X." Tell
what happens when each sequence is processed by a validation-based
scheduler.
a) R1(A, B); R2(B, C); V1; R3(C, D);
V3; W1(A); V2; W2(A);
W3(B);
b) R1(A, B); R2(B, C); V1; R3(C, D);
V3; W1(A); V2; W2(A);
W3(D);
c) R1(A, B); R2(B, C); V1; R3(C, D);
V3; W1(C); V2; W2(A);
W3(D);
- (Exercise 17.2. R&G pp. 574) Consider the following classes of schedules: serializable (S), conflict-serializable (CS), and view-serializable (VS). Recall that their definitions consider only the committed transactions (i.e. they ignore aborted transactions) and that the implications CS-->VS-->S hold.
Consider also the following classes of schedules: recoverable (R), avoid-cascading-aborts (ACA), and strict (S). They do depend on commit/abort actions and the implications S-->ACA-->R hold. For each of the schedules below, indicate the strictness class they belong to,. For each schedule you need to indicate two classes, e.g. S, ACA, or VS, R, or NONE,NONE, or S,n/a (if the commit/abort operations are missing).
1. T1:R(X), T2:R(X), T1:W(X), T2:W(X)
2. T1:W(X), T2:R(Y), T1:R(Y), T2:R(X)
3. T1:R(X), T2:R(Y), T3:W(X), T2:R(X), T1:R(Y)
4. T1:R(X), T1:R(Y), T1:W(X), T2:R(Y), T3:W(Y), T1:W(X), T2:R(Y)
5. T1:R(X), T2:W(X), T1:W(X), T2:Abort, T1:Commit
6. T1:R(X), T2:W(X), T1:W(X), T2:Commit, T1:Commit
7. T1:W(X), T2:R(X), T1:W(X), T2:Abort, T1:Commit
8. T1:W(X), T2:R(X), T1:W(X), T2:Commit, T1:Commit
9.T1:W(X), T2:R(X), T1:W(X), T2:Commit, T1:Abort
10. T2:R(X), T3:W(X), T3:Commit, T1:W(Y), T1:Commit, T2:R(Y), T2:W(Z), T2:Commit
11. T1:R(X), T2:W(X), T2:Commit, T1:W(X), T1:Commit, T3:R(X), T3:Commit
12. T1:R(X), T2:W(X), T1:W(X), T3:R(X), T1:Commit, T2:Commit, T3:Commit
- R&G (pp. 598)
- Exercise 18.4 Consider the execution shown in the following table (same as Fig. 18.7 in the book):
LSN |
Action |
pLSN |
uLSN |
00 |
update: T1 writes P2 |
|
|
10 |
update: T1 writes P1 |
|
|
20 |
update: T2 writes P5 |
|
|
30 |
update: T3 writes P3 |
|
|
40 |
T3 commit |
|
|
50 |
update: T2 writes P5 |
|
|
60 |
update: T2 writes P3 |
|
|
70 |
T2 abort |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Expand the figure to show prevLSN and undonextLSN values. Then show the actions taken to rollback transaction T2. Finally, show the log after T2 is rolled back, including all prevLSN and undonextLSN values in the log records.
- Exercise 18.5 Consider the execution shown in in the table below (same as Figure 18.8 in the book):
LSN |
Action |
pLSN |
uLSN |
00 |
begin_checkpoint |
|
|
10 |
end_checkpoint |
|
|
20 |
update: T1 writes P1 |
|
|
30 |
update: T2 writes P2 |
|
|
40 |
update: T3 writes P3 |
|
|
50 |
T2: commit |
|
|
60 |
update: T3 writes P2 |
|
|
70 |
T2: end |
|
|
80 |
update: T1 writes P5 |
|
|
90 |
T3: abort |
|
|
|
CRASH, RESTART |
|
|
In addition, the system crashes during recovery after writing two log records to stable storage and again after writing another two log records.
1. What is the value of the LSN stored in the master log record?
2. What is done during Analysis?
3. What is done during Redo?
4. What is done during Undo?
5. Show the log when recovery is complete, including all non-null prevLSN and undonextLSN values in the log records.
- Consider the following EDB predicates:
Female(x)
Male(x) -- everybody in the database is either Female or Male
Parent(x,y) -- x is a parent of y
Write stratified-datalog¬ programs to compute each of the queries below:
- Find the "Mitochondrial Eve", also known as "ancestral mother". This is a female x that is the great-great-... grand-mother of everyone in the database. You may assume that the database contains exactly one mitochondrial Eve. The real mitochondrial eve lived about 150,000 years ago http://en.wikipedia.org/wiki/Mitochondrial_Eve
- Two individuals x1, x2 are of the "same generation" if they have a common parent, or if they have parents in the same generation. Compute all individuals in the same generation with Alice.
- Now assume that the database represents a royal dynasty. In addition to the EDBs above, we also have:
FirstKing(x) -- the first king of the dynasty
Yob(x,year) -- year of birth of x
Alive(x)
Write a program to compute the current king (assume the dynasty follows the salic law http://en.wikipedia.org/wiki/Salic_law under which only males can become kings). The inheritance rule is: the succession goes to the oldest surviving son of the king, or, if there is none, then the rule applies to the male brother of the king, and so on. For example, if the ancestral King had two sons y and z, where y was born before z, and if no male descendants of y is alive, then the next king is a male descendant of z. In general, if x is an older brother of y, then x and all his living successors take precedence in the succession to y and his living successors.
- Incremental view maintenance. For each of the queries Q below, compute Delta(Q) assuming that we insert the tuples Delta(R) in the relation R. Assume that only the relation R is updated.
Q1(x,y) = S(x,z), R(z,v), T(v,y)
Q2(x,y) = R(x,z), R(z,y)
Q3(x,y) = S(x,z), R(z,v), T(v,w), R(w,s), K(s,y)
- Suppose B(R) = 10,000 and T(R) = 500,000. There is an index on R.a, and let V(R,a) = k, for some number k. Give the cost of σa=0(R) as a function of k, under the following circumstances. You may neglect the disk I/O's needed to access the index itself.
- The index is clustered
- The index is not clustered
- No index is used
- Find the values k that separate the cases above. Your answer should be something like this: "for k < 44 an unclustered index is better, while for k > 44 a clustered index is better".
- Consider the following query:
SELECT *
FROM R, S, T, U
WHERE R.A='abc' and R.B=S.B and S.C=T.C and T.D=U.D and U.E='efg';
Estimate the number of tuples returned by the query, assuming the following statistics on the database:
T(R) = 1M, V(R,A) = 100, V(R,B) = 1000
T(S) = 1M, V(S,B) = 100, V(S,C) = 1000
T(T) = 1M, V(T,C) = 100, V(T,D) = 1000
T(U) = 1M, V(U,D) = 100, V(U,E) = 1000
- Provenance. Consider a relational database schema R(A); S(B; C). All queries mentioned below are assumed to be monotone queries.
- One of the answers a of a relational query Q has the provenance
polynomial x1y12 + 2x2y1y2, where x1, x2 are annotations of two tuples in R and y1, y2 are annotations of two tuples in S.
- Assume that the input relations have no duplicates. If we evaluate the query under bag semantics, how many copies of a will be in the query's
answer ?
- Assume that each tuple in the input relations occurs exactly twice. If we evaluate Q under bag semantics, how many copies of a are there in
the query's answer ?
- What is the smallest number of tuples that need to be removed from the database instance in order to remove a from the answer to Q ?
- Here we evaluate the query under set semantics. Suppose that the tuple
annotated with y2 was incorrect (it contains some incorrect value),
and should not be in the database. Is the answer a correct, or did it
become incorrect ?
- Consider the following database instance:
R =
S=
A |
B |
|
a1 |
b1 |
y1 |
a1 |
b2 |
y2 |
a2 |
b2 |
y3 |
For each
polynomial below, write a query that has an answer whose provenance is that polynomial. Your query may include inequality operators, <, >, !=.
x1y1+x1y2+x2y3
x1y12+x1y22+x1y2y3+x2y2y3+x2y32
x12y12+x12y22+2x1x2y2y3+x22y32