CSEP 544 Homework 5
Objectives
To understand query execution, recovery and concurrency control, datalog, and provenance.
Reading Assignments
Required: lecture notes, Chapters 16,17,18 from Ramakrishan&Gehrke (R&G). Optional: Chapter 18 from Database Systems: the Complete Book (DS:CB). Lecture notes on datalog and on provenance.
Number of points:
100
Due Date
March 10, 2014, at 11:00 pm in the dropbox
  1. For each of the following schedules:

      a) r1(A); r2(A); r3(B); w1(A); r2(C); r2(B); w2(B); w1(C);
      b) r1(A); w1(B); r2(B); w2(C); r3(C); w3(A);
      d) r1(A); r2(A); w1(B); w2(B); r1(B); r2(B); w2(C); w1(D);

    Answer the following questions:

      i. Draw the precedence graph for the schedule.
      ii. Indicate whether the schedule conflict-serializable. If so, then indicate all the equivalent serial schedules.
  2. Fix a schedule S. Say that a transaction T precedes a transaction U every action of T precedes every action of U. Note that if T and U are the only transactions in S, then S is the serial schedule (T, U). However, if S involves transactions other than T and U, then S might not be a serial schedule, and in fact, may not even be conflict-serializable, because of the effect of other transactions. Give an example of a schedule S such that:

      i. In S, T1 precedes T2, and
      ii. S is conflict-serializable, but
      iii. In every serial schedule conflict-equivalent to S, T2 precedes T1.
  3. Consider a concurrency-control system based on time stamps. Below are several sequences of events, including start events, where st1; means that transaction T1 starts. These sequences represent real time, and the timestamp scheduler will allocate timestamps to transactions in the order of their starts. Tell what happens as each executes. You need to indicate the actions below that case the transaction to be rolled back or to be delayed.

      a) st1; st2; r1(A); r2(B); w2(A); w1(B);
      b) st1; r1(A); st2; w2(B); r2(A); w1(B);
      c) st1; st2; st3; r1(A); r2(B); w1(C); r3(B); r3(C); w2(B); w3(A);
      d) st1; st3; st2; r1(A); r2(B); w1(C); r3(B); r3(C); w2(B); w3(A);
  4. Consider a concurrency-control system based on validation. In the following sequence of events, we use Ri(X); to mean "transaction Ti starts, and its read set is the list of database elements X." Also, Vi means "Ti attempts to validate," and Wi(X); means that "Ti finishes, and its write set was X." Tell what happens when each sequence is processed by a validation-based scheduler.

      a) R1(A, B); R2(B, C); V1; R3(C, D); V3; W1(A); V2; W2(A); W3(B);
      b) R1(A, B); R2(B, C); V1; R3(C, D); V3; W1(A); V2; W2(A); W3(D);
      c) R1(A, B); R2(B, C); V1; R3(C, D); V3; W1(C); V2; W2(A); W3(D);
  5. (Exercise 17.2. R&G pp. 574) Consider the following classes of schedules: serializable (S), conflict-serializable (CS), and view-serializable (VS). Recall that their definitions consider only the committed transactions (i.e. they ignore aborted transactions) and that the implications CS-->VS-->S hold.
    Consider also the following classes of schedules: recoverable (R), avoid-cascading-aborts (ACA), and strict (S). They do depend on commit/abort actions and the implications S-->ACA-->R hold. For each of the schedules below, indicate the strictness class they belong to,. For each schedule you need to indicate two classes, e.g. S, ACA, or VS, R, or NONE,NONE, or S,n/a (if the commit/abort operations are missing).

  6. 1. T1:R(X), T2:R(X), T1:W(X), T2:W(X)
    2. T1:W(X), T2:R(Y), T1:R(Y), T2:R(X)
    3. T1:R(X), T2:R(Y), T3:W(X), T2:R(X), T1:R(Y)
    4. T1:R(X), T1:R(Y), T1:W(X), T2:R(Y), T3:W(Y), T1:W(X), T2:R(Y)
    5. T1:R(X), T2:W(X), T1:W(X), T2:Abort, T1:Commit
    6. T1:R(X), T2:W(X), T1:W(X), T2:Commit, T1:Commit
    7. T1:W(X), T2:R(X), T1:W(X), T2:Abort, T1:Commit
    8. T1:W(X), T2:R(X), T1:W(X), T2:Commit, T1:Commit
    9.T1:W(X), T2:R(X), T1:W(X), T2:Commit, T1:Abort
    10. T2:R(X), T3:W(X), T3:Commit, T1:W(Y), T1:Commit, T2:R(Y), T2:W(Z), T2:Commit
    11. T1:R(X), T2:W(X), T2:Commit, T1:W(X), T1:Commit, T3:R(X), T3:Commit
    12. T1:R(X), T2:W(X), T1:W(X), T3:R(X), T1:Commit, T2:Commit, T3:Commit

  7. R&G (pp. 598)
    1. Exercise 18.4 Consider the execution shown in the following table (same as Fig. 18.7 in the book):


      LSN Action pLSN uLSN
      00 update: T1 writes P2    
      10 update: T1 writes P1    
      20 update: T2 writes P5    
      30 update: T3 writes P3    
      40 T3 commit    
      50 update: T2 writes P5    
      60 update: T2 writes P3    
      70 T2 abort    
             
             
             

      Expand the figure to show prevLSN and undonextLSN values. Then show the actions taken to rollback transaction T2. Finally, show the log after T2 is rolled back, including all prevLSN and undonextLSN values in the log records.
    2. Exercise 18.5 Consider the execution shown in in the table below (same as Figure 18.8 in the book):
      LSN Action pLSN uLSN
      00 begin_checkpoint    
      10 end_checkpoint    
      20 update: T1 writes P1    
      30 update: T2 writes P2    
      40 update: T3 writes P3    
      50 T2: commit    
      60 update: T3 writes P2    
      70 T2: end    
      80 update: T1 writes P5    
      90 T3: abort    
        CRASH, RESTART    

      In addition, the system crashes during recovery after writing two log records to stable storage and again after writing another two log records.
      1. What is the value of the LSN stored in the master log record?
      2. What is done during Analysis?
      3. What is done during Redo?
      4. What is done during Undo?
      5. Show the log when recovery is complete, including all non-null prevLSN and undonextLSN values in the log records.
  8. Consider the following EDB predicates:

    Female(x)
    Male(x) -- everybody in the database is either Female or Male
    Parent(x,y) -- x is a parent of y

    Write stratified-datalog¬ programs to compute each of the queries below:

    1. Find the "Mitochondrial Eve", also known as "ancestral mother". This is a female x that is the great-great-... grand-mother of everyone in the database. You may assume that the database contains exactly one mitochondrial Eve. The real mitochondrial eve lived about 150,000 years ago http://en.wikipedia.org/wiki/Mitochondrial_Eve

    2. Two individuals x1, x2 are of the "same generation" if they have a common parent, or if they have parents in the same generation. Compute all individuals in the same generation with Alice.

    3. Now assume that the database represents a royal dynasty. In addition to the EDBs above, we also have:

      FirstKing(x) -- the first king of the dynasty
      Yob(x,year) -- year of birth of x
      Alive(x)

      Write a program to compute the current king (assume the dynasty follows the salic law http://en.wikipedia.org/wiki/Salic_law under which only males can become kings). The inheritance rule is: the succession goes to the oldest surviving son of the king, or, if there is none, then the rule applies to the male brother of the king, and so on. For example, if the ancestral King had two sons y and z, where y was born before z, and if no male descendants of y is alive, then the next king is a male descendant of z. In general, if x is an older brother of y, then x and all his living successors take precedence in the succession to y and his living successors.

  9. Incremental view maintenance. For each of the queries Q below, compute Delta(Q) assuming that we insert the tuples Delta(R) in the relation R. Assume that only the relation R is updated.


    Q1(x,y) = S(x,z), R(z,v), T(v,y)
    Q2(x,y) = R(x,z), R(z,y)
    Q3(x,y) = S(x,z), R(z,v), T(v,w), R(w,s), K(s,y)


  10. Suppose B(R) = 10,000 and T(R) = 500,000. There is an index on R.a, and let V(R,a) = k, for some number k. Give the cost of σa=0(R) as a function of k, under the following circumstances. You may neglect the disk I/O's needed to access the index itself.
    1. The index is clustered
    2. The index is not clustered
    3. No index is used
    4. Find the values k that separate the cases above. Your answer should be something like this: "for k < 44 an unclustered index is better, while for k > 44 a clustered index is better".

  11. Consider the following query:

    SELECT *
    FROM R, S, T, U
    WHERE R.A='abc' and R.B=S.B and S.C=T.C and T.D=U.D and U.E='efg';

    Estimate the number of tuples returned by the query, assuming the following statistics on the database:
    T(R) = 1M, V(R,A) = 100, V(R,B) = 1000
    T(S) = 1M, V(S,B) = 100, V(S,C) = 1000
    T(T) = 1M, V(T,C) = 100, V(T,D) = 1000
    T(U) = 1M, V(U,D) = 100, V(U,E) = 1000

  12. Provenance. Consider a relational database schema R(A); S(B; C). All queries mentioned below are assumed to be monotone queries.
    1. One of the answers a of a relational query Q has the provenance polynomial x1y12 + 2x2y1y2, where x1, x2 are annotations of two tuples in R and y1, y2 are annotations of two tuples in S.

      1. Assume that the input relations have no duplicates. If we evaluate the query under bag semantics, how many copies of a will be in the query's
        answer ?
      2. Assume that each tuple in the input relations occurs exactly twice. If we evaluate Q under bag semantics, how many copies of a are there in
        the query's answer ?
      3. What is the smallest number of tuples that need to be removed from the database instance in order to remove a from the answer to Q ?
      4. Here we evaluate the query under set semantics. Suppose that the tuple annotated with y2 was incorrect (it contains some incorrect value), and should not be in the database. Is the answer a correct, or did it become incorrect ?

    2. Consider the following database instance:

      R =
      A  
      a1 x1
      a2 x2

      S=
      A B  
      a1 b1 y1
      a1 b2 y2
      a2 b2 y3


      For each polynomial below, write a query that has an answer whose provenance is that polynomial. Your query may include inequality operators, <, >, !=.

      x1y1+x1y2+x2y3

      x1y12+x1y22+x1y2y3+x2y2y3+x2y32

      x12y12+x12y22+2x1x2y2y3+x22y32