CSE 444 Homework 4

Objectives:

To understand query plans and to understand recovery.

Reading Assignments:

Chapters 15.1, 15.2, 15.3, 15.4, 15.5, 15.6, 17.2

Number of points:

100

Due date:

Thursday, December 11, 2003, at 12pm

Software tools:

SQL server

Questions:

[40 points] Query execution.

Note: This problem is designed to make you appreciate how a relational query engine processes a large data instance. You will be asked to play with SQL Server, but the answers you are asked to turn in are rather trivial. The goal is to determine you to try things out, not to solve a quiz. You are strongly encouraged to try out more than what you are asked to do. Finally: be aware that these SQL queries will load the SQL Server significantly: don't wait until the last day to run your queries, or you may find the server swamped by your colleagues.

Consider the three files depts.txt, detps_prods.txt, prods.txt . Each text file contains a table with fixed-length fields, with the first line containing the attribute names. The attribute names and their length are:
```
    depts(DID(20), NAME(40), REGION(30)))
    depts_prods(DID(20), PID(20))
    prods(PID(20), NAME(40), PRICE(20))
  
```
All fields are CHAR except PRICE which is a FLOAT (which is represented using 20 characters).
1. Load this data in SQL server in your own database. Declare all fields of type CHAR except for PRICE which should be FLOAT. All fields should be NOT NULL. Do not declare UNIQUE or KEYS for any attribute (normally they should be unique, but the data is not clean and they are not unqiue). Do not create any indexes yet. Instruct SQL server to compute statistics for each field: go to Tools->manage statistics, and figure out what you need to do.
2. Write two SQL queries. The first computes for each region how many departments are in that that region. (More precisely: for each region how many records there are in depts with that region.). The second computes for each region how many products have been sold by departments in that region (for this query you only need to join depts with depts_prods; you don't need to join with prods). Both queries should be very simple count(*) with GROUP-BY queries: the first without any joins, the second with one join.
  
  Turn in the following information for each of the two queries: the estimated running time and the actual running time. In addition, indicate what join method the query optimizer selected for the second query (i.e. you will say 'nested loop' or 'hash join' or ...).
  
  For this and the following question you need to inspect the query plan. Click on the corresponding button at the top of the Query Analyzer, or hit CTRL-L. You also need to inspect the estimated cost: click on the query plan's root node and read the 'subtree cost' (in seconds). Finally, you need to measure the actual running time: this is shown by the Query Analyzer at the bottom right; times under 1second are shown as 0.
3. Consider the following query, which computes for each region, the average price of all products sold in that region:
```
    SELECT depts.region,  avg(prods.price)
    FROM   prods, depts_prods, depts
    WHERE  depts.did = depts_prods.did and
           depts_prods.pid =  prods.pid
    GROUP BY depts.region
    
```
  Turn in the following: the query's answer, the estimated and actual running times, and the join methods selected by the query optimizer for the two joins.
4. Inspect the query plan at point (c) again: there are two aggregate operators instead of one ! We will refer to them as the 'left' and the 'right' aggregate operator. Indicate which of the two operators is redundant, i.e. could be removed from the query plan and the query would still be correct (you have to say 'the left aggregate operator is redundant' or 'the right aggregate operator is redundant'). Next, explain why the query optimizer introduced that redundant aggregate operator. Finally, give the algebraic equation from the class notes that justifies the introduction of this redundant aggregate operator.
5. Consider the following two join queries:
```
    SELECT depts.region, depts_prods.did
    FROM depts, depts_prods
    WHERE depts.did =  depts_prods.did
    
```
  and
```
    SELECT count(*)
    FROM depts, depts_prods
    WHERE depts.did =  depts_prods.did
    
```
  Report the estimated and actual running times for both queries. The first query may take significantly longer than the second: abort it if it runs over 3-4 minutes (it should terminate before that, however). Explain briefly why it takes so much longer, even though the query plans are almost identical. Report the number of estimated tuples in the first query, and the actual number of tuples returned by the first query (you can find out the latter even if you have to abort it).
  
  At this point you may want to play with additional queries, check their plans and their running times.
6. Now create four indexes, on the DID and PID attributes. Check the books-online pages for the syntax of the CREATE INDEX command. Create clustered indexes on each of the tables depts and prods, and unclustered indexes on detps_prods. Remember that neither DID nor PID are unique attributes, in any of the tables. Answer questions (c) and (e) again, on the database with indexes. Indicate where you see significant changes.
  
  You may want to, but are not required to, play and design different types of indexes (e.g. clustered/unclustered) and check their effectiveness on the query execution time.

[30 points] Consider two tables, R(A,B) and S(C,D), where C is a key in S, and the selection-join operation s_A=v(R) ⋈_B=C S. Assume that a record is as large as a block and that T(R)=B(R)=B(S)=10000. There is a clustered index on S.C, and we need four disk accesses to lookup a value in that index (including the disk access to get the record). We consider the following methods for the join:

- block nested loop join

- partitioned hash-join

- index join

Compute the I/O cost for each join method in each of the four cases below. Indicate in each of the four cases the best join method.

	V(R,A) = 1	V(R,A) = 10000
M=101
M=5002

[25 points] After a system’s crash, the undo-log using non-quiescent checkpointing contains the following data:



a.      What are the correct values of the three <START CKPT ????> records ? You have to provide three correct values for the three “????”s.

b.      Assuming that the three <START CKPT ???> records are correctly stored in the log, according to your answer at a., show which elements are recovered by the undo recovery manager and compute their values after recovery.

c.      Indicate what fragment of the log the recovery manager needs to read.
[5points] Please answer the following questions below.
1. How long did it take you to complete this assignment?
2. What did you like the best about this assignment?
3. What did you like the least about this assignment?
4. What helped you learn the best in this assignment?
5. What distracted from your learning in this assignment?