CSE 444 Homework 4

Objectives:: To understand query execution, recovery and concurrency control.
Reading Assignments:: Chapters: 15.1, 15.2, 15.3, 15.4, 15.5, 15.6, 17.2, 18
Number of points:: 100
Due date:: December 6th
Assignment Tools:: SQL Server, and a programming language (C, C#, or Java)

[15 points] After a system’s crash, the undo-log using non-quiescent checkpointing contains the following data:



      a.      What are the correct values of the three <START CKPT ????> records ? You have to provide three correct values for the three “????”s.

      b.      Assuming that the three <START CKPT ???> records are correctly stored in the log, according to your answer at a., show which elements are recovered by the undo recovery manager and compute their values after recovery.

        c.      Indicate what fragment of the log the recovery manager needs to read.

[15 points] The SuperSQL database system stores its undo log file in a table, with the following schema:

Log(N, T, E, V)

where N is the entry number (0, 1, 2, 3, …), T is the transaction id, E is the element id, and V is the old value. A log entry of the form <T, E, V> is simply represented by the tuple (N, T, E, V), where N is the entry number; here E > 0. The log entries <START T>, <COMMIT T>, and <ABORT T> are represented by a tuple (N, T, E, null), where E=-1 for START, E=-2 for COMMIT, and E=-3 for ABORT. For example, the log:

. . . .

Is represented by the table:

N	T	E	V
0	T1	-1
1	T1	X1	55
2	T2	-1
3	T2	X2	99
4	T1	-2
. . .

Recall that each transaction starts and ends at most once, i.e. a sequence <START T> … <COMMIT T> … <START T> … will not occur in the log. Moreover, any update by the transaction T will occur between its <START T> and <COMMIT T>, or between <START T> and <ABORT T> respectively.

Write a SQL query that can be run during database recovery, after a system crash. The answer to your query should return a table with two attributes, E, and V, indicating which elements have to be updated with what values. You should include each element E at most once in your answer: otherwise it would be ambiguous how to update it.

[20 points] Concurrency Control Solve the following problems from the textbook:
1. Excercise 18.2.4 (a,b,d) and (i, ii, iii) on page 931
2. Excercise 18.8.1 (a,b) on page 978
3. Excercise 18.9.1 (a,b) on page 984
[25 points] Hash functions Mr. Frumble decided to implement his favorite hash function as follows:

            hash(s₁s₂...s_k, n) = (∑_i=1,n s_i ) mod n

An implementation in Visual C++ is given here: hash.cpp

Most major database systems use this hash function, so Mr. Frumble thought he won't be fired for implementing it this way... He is running an application where he maintains about 10,000 different bank accounts. In one task, he has to check daily if all account numbers are unique, i.e. detect any duplicate accounts. To avoid a brute-force quadratic search algorithm (which would take 10000*9999/2 comparisons) Mr. Frumble uses the hash function to partition the accounts into 100 buckets. He hopes that this will put about 100 accounts in each bucket, give or take a few. Then, he will look for duplicates in each bucket separately, which requires only 100*99/2+100*99/2+...+100*99/2 comparisons (100 times). Mr. Frumble has a point, since the accounts are really very uniformly distributed. On a typical day, his 10,000 account numbers are the following strings::

    A0000
    A0001
    A0002
    . . . .
    A9998
    A9999

That is, each account starts with the letter 'A', followed by four digits. But, to Mr. Frumble's surprise, the application runs much slower than he expects: each new transaction seems to require significantly more comparisons than he expected.

Help Mr. Frumble understand what is going on, by drawing the histogram of the bucket sizes, assuming exactly the 10,000 accounts A0000...A9999 shown above. This is a graph whose Ox axis has values 0, 1, 2, ..., 99 corresponding to the buckets, and the Oy axis has values between 0 and 10000, showing the number of accounts in that bucket. You have a turn in a graph generated by some tool such as Excel, or gnuplot, or mathlab. To generate the data for this graph you may either write a C++ program (you may use the hash function provided above) or you may derive a mathematical formula that computes the size of each bucket, then run it on Excel or on any other tool. Please turn in any auxiliary material that you used to generate the graph.
Using the data you collected at point (a), compute how much slower Mr. Frumble's application runs compared to what he expected, assuming the 10000 accounts above. Recall that the number of comparisons for duplicate detection is quadratic in the number of items in each bucket: that is, if the buckets have 4, 52, 9, 125, 58, ... items each, then the running time is 4*3/2 + 52*51/2 + 9*8/2 + 125*124/2 + 58*57/2 + ... Show how much larger the number of comparisons is, conpared to the uniform bucket distribution.
Design a better hash function that would partition Mr. Frumble's accounts more uniformly, and show the new histogram. Compute the running time at point (b) for your new hash function. For this question there is no best solution: feel free to use your creativity. However, some hash functions are not considered good solutions, if they work well only on the 10,000 accounts shown above, or make other strict assumptions about the specific representation of accounts, or if they do not use all the characters in each account number. You have to turn in both a program implementing the new hash function, a histogram, and a number indicating the running time.

[25 points] Query execution.

Note: This problem is designed to make you appreciate how a relational query engine processes a large data instance. You will be asked to play with SQL Server, but the answers you are asked to turn in are rather trivial. The goal is to determine you to try things out, not to solve a quiz. You are strongly encouraged to try out more than what you are asked to do. Finally: be aware that these SQL queries will load the SQL Server significantly: don't wait until the last day to run your queries, or you may find the server swamped by your colleagues.

Consider the three files depts.txt, depts_prods.txt, prods.txt . Each text file contains a table with fixed-length fields, with the first line containing the attribute names. The attribute names and their length are:
```
    depts(DID(20), NAME(40), REGION(30)))
    depts_prods(DID(20), PID(20))
    prods(NAME(40), PID(20), PRICE(20))
  
```
All fields are CHAR except PRICE which is a FLOAT (which is represented using 20 characters).
1. Load this data in SQL server in your own database. Declare all fields of type CHAR except for PRICE which should be FLOAT. All fields should be NOT NULL. Do not declare UNIQUE or KEYS for any attribute (normally they should be unique, but the data is not clean and they are not unqiue). Do not create any indexes yet. Instruct SQL server to compute statistics for each field: go to Tools->manage statistics, and figure out what you need to do.
2. Run the following two SQL queries. The first computes for each region how many departments are in that that region. (More precisely: for each region how many records there are in depts with that region.):
```
     select depts.region, count(*)
     from depts
     group by depts.region
```
  The second computes for each region how many products have been sold by departments in that region:
```
     select depts.region, count(*)
     from depts, depts_prods
     where depts.did = depts_prods.did 
      group by depts.region
```
  Turn in the following information for each of the two queries: the estimated running time and the actual running time. In addition, indicate what join method the query optimizer selected for the second query (i.e. you will say 'nested loop' or 'hash join' or ...).
  
  For this and the following question you need to inspect the query plan. Click on the corresponding button at the top of the Query Analyzer, or hit CTRL-L. You also need to inspect the estimated cost: click on the query plan's root node and read the 'subtree cost' (in seconds). Finally, you need to measure the actual running time: this is shown by the Query Analyzer at the bottom right; times under 1second are shown as 0.
3. Consider the following query, which computes for each region, the average price of all products sold in that region:
```
    SELECT depts.region,  avg(prods.price)
    FROM   prods, depts_prods, depts
    WHERE  depts.did = depts_prods.did and
           depts_prods.pid =  prods.pid
    GROUP BY depts.region
    
```
  Turn in the following: the query's answer, the estimated and actual running times, and the join methods selected by the query optimizer for the two joins.
4. Inspect the query plan at point (c) again: there are two aggregate operators instead of one ! We will refer to them as the 'left' and the 'right' aggregate operator. Indicate which of the two operators is redundant, i.e. could be removed from the query plan and the query would still be correct (you have to say 'the left aggregate operator is redundant' or 'the right aggregate operator is redundant'). Next, explain why the query optimizer introduced that redundant aggregate operator. Finally, give the algebraic equation from the class notes that justifies the introduction of this redundant aggregate operator.
5. Consider the following two join queries:
```
    SELECT depts.region, depts_prods.did
    FROM depts, depts_prods
    WHERE depts.did =  depts_prods.did
    
```
  and
```
    SELECT count(*)
    FROM depts, depts_prods
    WHERE depts.did =  depts_prods.did
    
```
  Report the estimated and actual running times for both queries. The first query may take significantly longer than the second: abort it if it runs over 3-4 minutes (it should terminate before that, however). Explain briefly why it takes so much longer, even though the queries are similar.. Report the number of estimated tuples in the first query, and the actual number of tuples returned by the first query (you can find out the latter even if you have to abort it).
  
  At this point you may want to play with additional queries, check their plans and their running times.
6. Now create four indexes, on the DID and PID attributes. Check the books-online pages for the syntax of the CREATE INDEX command. Create clustered indexes on each of the tables depts and prods, and unclustered indexes on detps_prods. Remember that neither DID nor PID are unique attributes, in any of the tables. Answer questions (b) and (c) again, on the database with indexes. Indicate where you see significant changes.
  
  You may want to, but are not required to, play and design different types of indexes (e.g. clustered/unclustered) and check their effectiveness on the query execution time.