CSE 444 Homework 5

Objectives:

To understand issues involved in processing large data instances, query execution, query optimization, and recovery.

Reading Assignments:

Chapters 15.1, 15.2, 15.3, 15.4, 15.5, 15.6, 17.2

Number of points:

100

Due date:

December 9, 2002

Software tools:

SQL server, some general purpose programming language at your choice (e.g. C++, Java, C#), some graph drawing tool at your choice (e.g. excel, gnuplot, mathematica, maple)

Questions:

  1. [25 points] Hash functions Mr. Frumble decided to implement his favorite hash function as follows:

                hash(s1s2...sk, n) = (Si=1,n  si )  mod n

    An implementation in Visual C++ is given here: hs5-hash.cpp

    Most major database systems use this hash function, so Mr. Frumble thought he won't be fired for implementing it this way.. In this application he has match a large number of transactions with a set of 10000 bank accounts, and wants to use the hash function to partition the accounts into 100 buckets. "That should give me about 100 accounts in each bucket, give or take a few", thinks Mr. Fumble, "so I should be able to find for each transaction the corresponding account with only 100 comparisons, on average". Mr. Frumble has a point, since the accounts are really very uniformly distributed: namely, the are the following strings::

        A0000
        A0001
        A0002
        . . . .
        A9998
        A9999

    That is, each account starts with the letter 'A', followed by four digits. Each of the ten thousand possible combinations of digits occurs exactly once. But, to Mr. Frumble's surprise, the application runs much slower than he expects, as if he needs to compare each transaction with far more than 100 accounts.
    1. [10 points] Help Mr. Frumble understand what is going on, by drawing the histogram of the bucket sizes. This is a graph whose Ox axis has values 0, 1, 2, ..., 99 corresponding to the buckets, and the Oy axis has values between 0 and 10000, showing the number of accounts in that bucket. You have a turn in a graph generated by any tool (say Excel, or gnuplot, or mathlab, etc). To generate the data for this graph you may either write a C++ program, your you may derive a mathematical formula that computes the size of each bucket, then run it on Excel or on any other tool. Please turn in any auxiliary material that you used to generate the graph.
    2. [5 points] Using the data you collected at point (a), compute how much slower Mr. Frumble's application runs compared to what he expected. The running time for the application is quadratic in the number of items in a bucket: that is, if the buckets have 4, 52, 9, 125, 58, ... items each, then the running time is 42+522+92+1252+582+...  For that you need to compute the running time Mr. Frumble expects, assuming all buckets have the same size, and compare it to the running time on the real hash table.
    3. [10 points] Design a better hash function that would partition Mr. Frumble's accounts more uniformly, and show the new histogram. Compute the running time at point (b) for your new hash function. You have to turn in both a program implementing the new hash function, a histogram, and a number indicating the running time.


  2. [40 points] Query execution.

    Note
    : This problem is designed to make you appreciate the issues that come up in dealing with large data instances. Some of the questions ask you to play with SQL Server, and the answers you have to turn is sometimes quite trivial. The goal is to determine you to try things out, not to solve a quiz. You are strongly encouraged to try out more than what you are asked to do. Finally: be aware that these SQL queries will load the SQL Server significantly: don't wait until the last day to run your queries, or you will find that SQL server is swamped by your colleagues.

    Consider the three files depts.txt, detps_prods.txt, prods.txt . Each text file contains a table with fixed-length fields, with the first line containing the attribute names.  The attribute names and their length are:

            depts(DID(20),  NAME(40), REGION(30)))
            depts_prods(DID(20), PID(20))
            prods(PID(20),  NAME(40), PRICE(20))


    All fields are CHAR except PRICE which is a FLOAT (which is represented using 20 characters).
    1. [5 points] Download the files to your local disk. Write a program in your favorite language (C++ or Java or C# or any other language) to count the number of occurrences of each REGION in depts, and measure its running time.  Your program should print something like:

              region SOUTH occurs 495034 times
              region EAST    occurs 495382 times
              . . . . .

      There are very few distinct REGIONS in the database (under a dozen or so) and you may use information when writing your program.
    2. [5 points] Suppose you had to compute the same number of occurrences of each REGION in the join  depts DID=DID depts_prods. You use the same programming language, and you decide to use a block nested loop to implement the join.  Assume you have 10MB of main memory.  Based on the running time you measured in question (a), on the file sizes, and on your knowledge of the inner workings of the block nested loop join, estimate how long your program will take to run; explain your calculations. Notice: you are not required to implement this program, but are welcome to do it.
    3. [5 points] Now load this data in SQL server in your own database. Declare all fields of type CHAR except for PRICE which should be FLOAT. All fields should be NOT NULL. Do not declare UNIQUE or KEYS for any attribute (they are not unique). Do not create any indexes yet. Make sure that SQL server computed statistics for each field (go to Tools->manage statistics, and figure out what you need to do ). Write SQL queries for the questions (a) and (b),  and turn in the following information for each of the two queries: the estimated running time and the actual running time. In addition, indicate what join method the query optimizer selected for the second query (i.e. you will say 'nested loop' or 'hash join' or ...).

      For question and the following you need to inspect the query plan: click on the corresponding button at the top of the Query Analyzer, or hit CTRL-L. You also need to inspect the estimated cost: click on the query plan's root node and read the 'subtree cost' (in seconds). Finally, you need to measure the actual running time: this is shown by the Query Analyzer at the bottom right; times under 1second are shown as 0.
    4. [5 points] Compute, for each region, the average price of all products sold in that region:

          SELECT depts.region, avg(prods.price)
          FROM    prods, depts_prods, depts
          WHERE depts.did = depts_prods.did and
                        depts_prods.pid = prods.pid
          GROUP BY depts.region

      Turn in the following: the query's answer, the estimated and actual running times, and the join methods selected by the query optimizer for the two joins.
    5. [5 points] Inspect the query plan at point (d) again: there are two aggregate operators instead of one !  We will refer to them as the 'left' and the 'right' aggregate operator. Indicate which of the two operators is redundant, i.e. could be removed from the query plan and the query would still be correct (you have to say 'the left aggregate operator is redundant' or 'the right aggregate operator is redundant').  Next, explain why the query optimizer introduced that redundant aggregate operator. Finally, give the algebraic equation from the class notes that justifies the introduction of this redundant aggregate operator.
    6. [5 points] Consider the following two join queries:

          SELECT depts.region, depts_prods.did
          FROM depts, depts_prods
          WHERE depts.did = depts_prods.did

      and

          SELECT count(*)
          FROM depts, depts_prods
          WHERE depts.did = depts_prods.did

      Report the estimated and actual running times for both queries. The first query may take significantly longer than the first: abort it if it runs over 3-4 minutes (it may terminate before that, however). Explain briefly why it takes so much longer, even though the query plans are almost identical. Report the number of estimated tuples in the first query, and the actual number of tuples returned by the first query (you can find out the latter even if you have to abort it).

      At this point you may want to play with additional queries, check their plans and their running times.
    7. [10 points] Now create four indexes, on the DID and PID attributes. Check the books-online pages for the syntax of the CREATE INDEX command. Create clustered indexes on each of the tables depts and prods, and unclustered indexes on detps_prods. Remember that neither DID nor PID are unique attributes, in any of the tables. Answer questions (d) and (f) again, on the database with indexes. Indicate where you see significant changes.

      You may want to play here by designing different types of indexes (e.g. clustered/unclustered) and check their effectiveness on the query execution time.
  3. [15 poitns] Consider two tables, R(A,B) and S(C,D), where C is a key in S, and the selection-join operation   sA=v(R) B=C S.  Assume that a record is a large as a block and that  T(R)=B(R)=B(S)=10000.  There is a  clustered index on S.C, and we need four disk accesses to lookup a  value in that index (including the disk access to get the record).  We consider the following methods for the join:

                - block nested loop join

                - partitioned hash-join

                - index join

    Compute the I/O cost for each join method in each of the four cases below. Indicate in each of the four cases the best join method.

     

     

    V(R,A) = 1

    V(R,A) = 10000

    M = 101

     

     

    M = 5002

     

     

     

     

     

     



  4. [15 points] After a system’s crash, the undo-log using non-quiescent checkpointing contains the following data:

                                         

     

    a.      [5 points] What are the correct values of the three <START CKPT ????>  records ?  You have to provide three correct values for the three “????”s.

    b.      [5 points] Assuming that the three <START CKPT ???>  records are correctly stored in the log, according to your answer at a., show which elements are recovered by the undo recovery manager and compute their values after recovery.

    c.      [5 points]  Indicate what fragment of the log the recovery manager needs to read.

  5. [5 points] Please answer the following questions below. Please answer them online, at  http://iinetsrv.cs.washington.edu/CSE444Feedback/Start.aspx
    1. How long did it take you to complete this assignment?
    2. What did you like the best about this assignment?
    3. What did you like the least about this assignment?
    4. What helped you learn the best in this assignment?
    5. What distracted from your learning in this assignment?