- [40 points] Query execution.
Note: This problem is
designed to make you appreciate how a relational query engine
processes a large data instance. You will be asked to play with SQL
Server, but the answers you are asked to turn in are rather
trivial. The goal is to determine you to try things out, not to
solve a quiz. You are strongly encouraged to try out more than what
you are asked to do. Finally: be aware that these SQL queries will
load the SQL Server significantly: don't wait until the last
day to run your queries, or you may find the server swamped
by your colleagues.
Consider the three files depts.txt,
detps_prods.txt,
prods.txt
. Each text file contains a table with fixed-length fields, with the first
line containing the attribute names. The attribute names and their
length are:
depts(DID(20), NAME(40), REGION(30)))
depts_prods(DID(20), PID(20))
prods(PID(20), NAME(40), PRICE(20))
All fields are CHAR
except PRICE which is a FLOAT (which is represented using 20 characters).
- Load this data in SQL server in your own database. Declare
all fields of type CHAR except for PRICE which should be
FLOAT. All fields should be NOT NULL. Do not declare UNIQUE or
KEYS for any attribute (normally they should be unique, but the
data is not clean and they are not unqiue). Do not create any
indexes yet. Instruct SQL server to compute statistics for each
field: go to Tools->manage statistics, and figure out what you
need to do.
- Write two SQL queries. The first computes for each region
how many departments are in that that region. (More precisely:
for each region how many records there are in depts with
that region.). The second computes for each region how many
products have been sold by departments in that region (for this
query you only need to join depts with depts_prods;
you don't need to join with prods). Both queries should be
very simple count(*) with GROUP-BY queries: the first without any
joins, the second with one join.
Turn in the following
information for each of the two queries: the estimated running
time and the actual running time. In addition, indicate what join
method the query optimizer selected for the second query (i.e. you
will say 'nested loop' or 'hash join' or ...).
For this and the following question you need to inspect
the query plan. Click on the corresponding button at the top of
the Query Analyzer, or hit CTRL-L. You also need to inspect the
estimated cost: click on the query plan's root node and read the
'subtree cost' (in seconds). Finally, you need to measure the
actual running time: this is shown by the Query Analyzer at the
bottom right; times under 1second are shown as 0.
- Consider the following query, which computes for
each region, the average price of
all products sold in that region:
SELECT depts.region, avg(prods.price)
FROM prods, depts_prods, depts
WHERE depts.did = depts_prods.did and
depts_prods.pid = prods.pid
GROUP BY depts.region
Turn in the following: the query's answer, the estimated and
actual running times, and the join methods selected by the query
optimizer for the two joins.
- Inspect the query plan at point (c) again: there
are two aggregate operators instead of one ! We will refer
to them as the 'left' and the 'right' aggregate operator. Indicate
which of the two operators is redundant, i.e. could be removed
from the query plan and the query would still be correct (you have
to say 'the left aggregate operator is redundant' or 'the right
aggregate operator is redundant'). Next, explain why the query
optimizer introduced that redundant aggregate operator. Finally,
give the algebraic equation from the class notes that justifies
the introduction of this redundant aggregate operator.
- Consider the following two join
queries:
SELECT depts.region, depts_prods.did
FROM depts, depts_prods
WHERE depts.did = depts_prods.did
and
SELECT count(*)
FROM depts, depts_prods
WHERE depts.did = depts_prods.did
Report the estimated and actual running times for both
queries. The first query may take significantly longer than the
second: abort it if it runs over 3-4 minutes (it should terminate
before that, however). Explain briefly why it takes so much
longer, even though the query plans are almost identical. Report
the number of estimated tuples in the first query, and the actual
number of tuples returned by the first query (you can find out the
latter even if you have to abort it).
At this point you may
want to play with additional queries, check their plans and their
running times.
- Now create four indexes, on the DID and PID attributes.
Check the books-online pages for the syntax of the CREATE INDEX command.
Create clustered indexes on each of the tables depts and
prods, and unclustered indexes on detps_prods. Remember
that neither DID nor PID are unique attributes, in any of the tables. Answer
questions (c) and (e) again, on the database with indexes. Indicate where
you see significant changes.
You may want to, but are not required to, play and design
different types of indexes (e.g. clustered/unclustered) and check
their effectiveness on the query execution time.
-
[30 points] Consider two tables, R(A,B) and
S(C,D), where C is a key in S, and the selection-join operation sA=v(R)
⋈B=C
S. Assume that a
record is as large as a block and that
T(R)=B(R)=B(S)=10000. There is a clustered index on S.C, and we
need four disk accesses to lookup a
value in that index (including the disk access to get the record). We consider the following methods for
the join:
- block nested loop join
- partitioned hash-join
- index join
Compute the I/O cost for each join method in each of the
four cases below. Indicate in each of the four cases the best join
method.
|
V(R,A) = 1 |
V(R,A) = 10000 |
M=101 |
|
|
M=5002 |
|
|
-
[25
points] After a system’s crash, the undo-log using non-quiescent checkpointing
contains the following data:
![](hw4.ht1.gif)
a.
What are the correct values of the three <START CKPT ????> records ? You have to provide three correct
values for the three “????”s.
b.
Assuming that the three <START CKPT ???> records are correctly stored in the
log, according to your answer at a., show which elements are recovered by the
undo recovery manager and compute their values after recovery.
c.
Indicate what fragment of
the log the recovery manager needs to read.
- [5points] Please answer the following questions below.
- How long did it take you to complete this assignment?
- What did you like the best about this assignment?
- What did you like the least about this assignment?
- What helped you learn the best in this assignment?
- What distracted from your learning in this assignment?