CSE 444 Homework 4

Objectives:: To understand query execution, recovery and concurrency control.
Reading Assignments:: Chapters: 14.2, 15.1, 15.2, 15.3, 15.4, 15.5, 15.6, 16.4
Number of points:: 100
Due date:: December 7th
Assignment Tools:: SQL Server, and a programming language (C, C#, or Java)
Turnin Instructions:: TBA

[35 points] Hash functions Mr. Frumble decided to implement his favorite hash function as follows:

            hash(s₁s₂...s_k, n) = (∑_i=1,n s_i ) mod n

An implementation in Visual C++ is given here: hash.cpp

Most major database systems use this hash function, so Mr. Frumble thought he won't be fired for implementing it this way... He is running an application where he maintains about 10,000 different bank accounts. In one task, he has to check daily if all account numbers are unique, i.e. detect any duplicate accounts. To avoid a brute-force quadratic search algorithm (which would take 10000*9999/2 comparisons) Mr. Frumble uses the hash function to partition the accounts into 100 buckets. He hopes that this will put about 100 accounts in each bucket, give or take a few. Then, he will look for duplicates in each bucket separately, which requires only 100*99/2+100*99/2+...+100*99/2 comparisons (100 times). Mr. Frumble has a point, since the accounts are really very uniformly distributed. On a typical day, his 10,000 account numbers are the following strings::

    A0000
    A0001
    A0002
    . . . .
    A9998
    A9999

That is, each account starts with the letter 'A', followed by four digits. But, to Mr. Frumble's surprise, the application runs much slower than he expects: each new transaction seems to require significantly more comparisons than he expected.

Help Mr. Frumble understand what is going on, by drawing the histogram of the bucket sizes, assuming exactly the 10,000 accounts A0000...A9999 shown above. This is a graph whose Ox axis has values 0, 1, 2, ..., 99 corresponding to the buckets, and the Oy axis has values between 0 and 10000, showing the number of accounts in that bucket. You have a turn in a graph generated by some tool such as Excel, or gnuplot, or mathlab. To generate the data for this graph you may either write a C++ program (you may use the hash function provided above) or you may derive a mathematical formula that computes the size of each bucket, then run it on Excel or on any other tool. Please turn in any auxiliary material that you used to generate the graph.
Using the data you collected at point (a), compute how much slower Mr. Frumble's application runs compared to what he expected, assuming the 10000 accounts above. Recall that the number of comparisons for duplicate detection is quadratic in the number of items in each bucket: that is, if the buckets have 4, 52, 9, 125, 58, ... items each, then the running time is 4*3/2 + 52*51/2 + 9*8/2 + 125*124/2 + 58*57/2 + ... Show how much larger the number of comparisons is, conpared to the uniform bucket distribution.
Design a better hash function that would partition Mr. Frumble's accounts more uniformly, and show the new histogram. Compute the running time at point (b) for your new hash function. For this question there is no best solution: feel free to use your creativity. However, some hash functions are not considered good solutions, if they work well only on the 10,000 accounts shown above, or make other strict assumptions about the specific representation of accounts, or if they do not use all the characters in each account number. You have to turn in both a program implementing the new hash function, a histogram, and a number indicating the running time.

[30 points] Query execution.

Note: This problem is designed to make you appreciate how a relational query engine processes a large data instance. You will be asked to play with SQL Server, but the answers you are asked to turn in are rather trivial. The goal is to determine you to try things out, not to solve a quiz. You are strongly encouraged to try out more than what you are asked to do. Finally: be aware that these SQL queries will load the SQL Server significantly: don't wait until the last day to run your queries, or you may find the server swamped by your colleagues.

Consider the three files depts.txt, depts_prods.txt, prods.txt . Each text file contains a table with fixed-length fields, with the first line containing the attribute names. The attribute names and their length are:
```
    depts(DID(20), NAME(40), REGION(30)))
    depts_prods(DID(20), PID(20))
    prods(NAME(40), PID(20), PRICE(20))
  
```
All fields are CHAR except PRICE which is a FLOAT (which is represented using 20 characters).
1. This data is already loaded into 3 tables in a database named HW5. Create a database of your own and create a copy of these tables in your database. In the subsequent parts of this problem, you will work on your copy of the tables and not the ones in the database HW5. Do not create any indexes yet. The tables in HW5 already have some statistics computed on them. If you need to compute more, you can click the + sign next to the name of the table. Under the table name, you will see "Statistic". Right click on that to manage the statistics.
2. Run the following two SQL queries. The first computes for each region how many departments are in that that region. (More precisely: for each region how many records there are in depts with that region.):
```
     select depts.region, count(*)
     from depts
     group by depts.region
```
  The second computes for each region how many products have been sold by departments in that region:
```
     select depts.region, count(*)
     from depts, depts_prods
     where depts.did = depts_prods.did 
      group by depts.region
```
  Turn in the following information for each of the two queries: the estimated running time and the actual running time. In addition, indicate what join method the query optimizer selected for the second query (i.e. you will say 'nested loop' or 'hash join' or ...).
  
  For this and the following question you need to inspect the query plan. Click on the corresponding button at the top of the Query Analyzer, or hit CTRL-L. You also need to inspect the estimated cost: click on the query plan's root node and read the 'subtree cost' (in seconds). Finally, you need to measure the actual running time: this is shown by the Query Analyzer at the bottom right; times under 1second are shown as 0.
3. Consider the following query, which computes for each region, the average price of all products sold in that region:
```
    SELECT depts.region,  avg(prods.price)
    FROM   prods, depts_prods, depts
    WHERE  depts.did = depts_prods.did and
           depts_prods.pid =  prods.pid
    GROUP BY depts.region
    
```
  Turn in the following: the query's answer, the estimated and actual running times, and the join methods selected by the query optimizer for the two joins.
4. Inspect the query plan at point (c) again: there are two aggregate operators instead of one ! We will refer to them as the 'left' and the 'right' aggregate operator. Indicate which of the two operators is redundant, i.e. could be removed from the query plan and the query would still be correct (you have to say 'the left aggregate operator is redundant' or 'the right aggregate operator is redundant'). Next, explain why the query optimizer introduced that redundant aggregate operator. Finally, give the algebraic equation from the class notes that justifies the introduction of this redundant aggregate operator.
5. Consider the following two join queries:
```
    SELECT depts.region, depts_prods.did
    FROM depts, depts_prods
    WHERE depts.did =  depts_prods.did
    
```
  and
```
    SELECT count(*)
    FROM depts, depts_prods
    WHERE depts.did =  depts_prods.did
    
```
  Report the estimated and actual running times for both queries. The first query may take significantly longer than the second: abort it if it runs over 3-4 minutes (it should terminate before that, however). Explain briefly why it takes so much longer, even though the queries are similar.. Report the number of estimated tuples in the first query, and the actual number of tuples returned by the first query (you can find out the latter even if you have to abort it).
  
  At this point you may want to play with additional queries, check their plans and their running times.
6. Now create four indexes, on the DID and PID attributes. Check the books-online pages for the syntax of the CREATE INDEX command. Create clustered indexes on each of the tables depts and prods, and unclustered indexes on detps_prods. Remember that neither DID nor PID are unique attributes, in any of the tables. Answer questions (b) and (c) again, on the database with indexes. Indicate where you see significant changes.
  
  You may want to, but are not required to, play and design different types of indexes (e.g. clustered/unclustered) and check their effectiveness on the query execution time.
[35 points] Selectivity Estimation Solve Excercise 16.4.1 (a) to (i) on page 834 (ten questions).