CSEP 544 Homework 5

Objectives
To understand query execution, recovery and concurrency control, datalog, and provenance.

Reading Assignments
Required: lecture notes, Chapters 16,17,18 from Ramakrishan&Gehrke (R&G). Optional: Chapter 18 from Database Systems: the Complete Book (DS:CB). Lecture notes on datalog and on provenance.

Number of points:
100

Due Date
March 10, 2014, at 11:00 pm in the dropbox

For each of the following schedules:
Answer the following questions:
Fix a schedule S. Say that a transaction T precedes a transaction U every action of T precedes every action of U. Note that if T and U are the only transactions in S, then S is the serial schedule (T, U). However, if S involves transactions other than T and U, then S might not be a serial schedule, and in fact, may not even be conflict-serializable, because of the effect of other transactions. Give an example of a schedule S such that:
Consider a concurrency-control system based on time stamps. Below are several sequences of events, including start events, where st₁; means that transaction T₁ starts. These sequences represent real time, and the timestamp scheduler will allocate timestamps to transactions in the order of their starts. Tell what happens as each executes. You need to indicate the actions below that case the transaction to be rolled back or to be delayed.
Consider a concurrency-control system based on validation. In the following sequence of events, we use R_i(X); to mean "transaction T_i starts, and its read set is the list of database elements X." Also, V_i means "T_i attempts to validate," and W_i(X); means that "T_i finishes, and its write set was X." Tell what happens when each sequence is processed by a validation-based scheduler.
(Exercise 17.2. R&G pp. 574) Consider the following classes of schedules: serializable (S), conflict-serializable (CS), and view-serializable (VS). Recall that their definitions consider only the committed transactions (i.e. they ignore aborted transactions) and that the implications CS-->VS-->S hold.
Consider also the following classes of schedules: recoverable (R), avoid-cascading-aborts (ACA), and strict (S). They do depend on commit/abort actions and the implications S-->ACA-->R hold. For each of the schedules below, indicate the strictness class they belong to,. For each schedule you need to indicate two classes, e.g. S, ACA, or VS, R, or NONE,NONE, or S,n/a (if the commit/abort operations are missing).

R&G (pp. 598)

Exercise 18.4 Consider the execution shown in the following table (same as Fig. 18.7 in the book):

LSN	Action	pLSN	uLSN
00	update: T1 writes P2
10	update: T1 writes P1
20	update: T2 writes P5
30	update: T3 writes P3
40	T3 commit
50	update: T2 writes P5
60	update: T2 writes P3
70	T2 abort

Expand the figure to show prevLSN and undonextLSN values. Then show the actions taken to rollback transaction T2. Finally, show the log after T2 is rolled back, including all prevLSN and undonextLSN values in the log records.

Exercise 18.5 Consider the execution shown in in the table below (same as Figure 18.8 in the book):

LSN	Action	pLSN	uLSN
00	begin_checkpoint
10	end_checkpoint
20	update: T1 writes P1
30	update: T2 writes P2
40	update: T3 writes P3
50	T2: commit
60	update: T3 writes P2
70	T2: end
80	update: T1 writes P5
90	T3: abort
	CRASH, RESTART

In addition, the system crashes during recovery after writing two log records to stable storage and again after writing another two log records.
1. What is the value of the LSN stored in the master log record?
2. What is done during Analysis?
3. What is done during Redo?
4. What is done during Undo?
5. Show the log when recovery is complete, including all non-null prevLSN and undonextLSN values in the log records.

Consider the following EDB predicates:

Female(x)
Male(x) -- everybody in the database is either Female or Male
Parent(x,y) -- x is a parent of y

Write stratified-datalog¬ programs to compute each of the queries below:
1. Find the "Mitochondrial Eve", also known as "ancestral mother". This is a female x that is the great-great-... grand-mother of everyone in the database. You may assume that the database contains exactly one mitochondrial Eve. The real mitochondrial eve lived about 150,000 years ago http://en.wikipedia.org/wiki/Mitochondrial_Eve
2. Two individuals x1, x2 are of the "same generation" if they have a common parent, or if they have parents in the same generation. Compute all individuals in the same generation with Alice.
3. Now assume that the database represents a royal dynasty. In addition to the EDBs above, we also have:
  
  FirstKing(x) -- the first king of the dynasty
  Yob(x,year) -- year of birth of x
  Alive(x)
  
  Write a program to compute the current king (assume the dynasty follows the salic law http://en.wikipedia.org/wiki/Salic_law under which only males can become kings). The inheritance rule is: the succession goes to the oldest surviving son of the king, or, if there is none, then the rule applies to the male brother of the king, and so on. For example, if the ancestral King had two sons y and z, where y was born before z, and if no male descendants of y is alive, then the next king is a male descendant of z. In general, if x is an older brother of y, then x and all his living successors take precedence in the succession to y and his living successors.
Incremental view maintenance. For each of the queries Q below, compute Delta(Q) assuming that we insert the tuples Delta(R) in the relation R. Assume that only the relation R is updated.

Q1(x,y) = S(x,z), R(z,v), T(v,y)
Q2(x,y) = R(x,z), R(z,y)
Q3(x,y) = S(x,z), R(z,v), T(v,w), R(w,s), K(s,y)
Suppose B(R) = 10,000 and T(R) = 500,000. There is an index on R.a, and let V(R,a) = k, for some number k. Give the cost of σ_a=0(R) as a function of k, under the following circumstances. You may neglect the disk I/O's needed to access the index itself.
1. The index is clustered
2. The index is not clustered
3. No index is used
4. Find the values k that separate the cases above. Your answer should be something like this: "for k < 44 an unclustered index is better, while for k > 44 a clustered index is better".
Consider the following query:

SELECT *
FROM R, S, T, U
WHERE R.A='abc' and R.B=S.B and S.C=T.C and T.D=U.D and U.E='efg';

Estimate the number of tuples returned by the query, assuming the following statistics on the database:
T(R) = 1M, V(R,A) = 100, V(R,B) = 1000
T(S) = 1M, V(S,B) = 100, V(S,C) = 1000
T(T) = 1M, V(T,C) = 100, V(T,D) = 1000
T(U) = 1M, V(U,D) = 100, V(U,E) = 1000

Provenance. Consider a relational database schema R(A); S(B; C). All queries mentioned below are assumed to be monotone queries.

One of the answers a of a relational query Q has the provenance polynomial x₁y₁² + 2x₂y₁y₂, where x₁, x₂ are annotations of two tuples in R and y₁, y₂ are annotations of two tuples in S.
1. Assume that the input relations have no duplicates. If we evaluate the query under bag semantics, how many copies of a will be in the query's
  answer ?
2. Assume that each tuple in the input relations occurs exactly twice. If we evaluate Q under bag semantics, how many copies of a are there in
  the query's answer ?
3. What is the smallest number of tuples that need to be removed from the database instance in order to remove a from the answer to Q ?
4. Here we evaluate the query under set semantics. Suppose that the tuple annotated with y2 was incorrect (it contains some incorrect value), and should not be in the database. Is the answer a correct, or did it become incorrect ?

Consider the following database instance:

R =

A
a₁	x₁
a₂	x₂

A	B
a₁	b₁	y₁
a₁	b₂	y₂
a₂	b₂	y₃

For each polynomial below, write a query that has an answer whose provenance is that polynomial. Your query may include inequality operators, <, >, !=.

x₁y₁+x₁y₂+x₂y₃

x₁y₁²+x₁y₂²+x₁y₂y₃+x₂y₂y₃+x₂y₃²

x₁²y₁²+x₁²y₂²+2x₁x₂y₂y₃+x₂²y₃²