Solution to Homework 4

 

Michael Gubanov, mgubanov@cs

 

 

1.     Query Execution

 

a.                       

                                                           i.      Query 1

select region, count(d.did) as DCount

from depts as d

group by region

 

Estimated Running Time: 1.77 seconds

Actual Running Time: less than a second[1]

 

                                                        ii.      Query 2

select region, count(d.did) as DCount

from depts as d

join depts_prods as dp on d.did = dp.did

group by region

 

Estimated Running Time: 16.7 seconds

Actual Running Time: 4 seconds

Join Method: Hash Join

 

b.                       

                                                           i.      Query 3

select d.region, avg(p.price) as AvgPrice

   from prods as p 
 join depts_prods as dp on dp.pid = p.pid
  join depts as d on d.did = dp.did

group by d.region

 

Output:

region AvgPrice

------------------------------ --------------------------

Central 599.92060525118245

Mountain 498.32875574601167

South 400.95740872043115

NorthWest 200.12154503004854

East 700.28072857480288

Canada 799.53329222268451

SouthWest 298.5248630158319

 

(7 row(s) affected)

 
  Estimated Running Time: 41.4 seconds
 Actual Running Time: 7 seconds
  Join Method: Both Hash Join

c.                       

The right aggregate operator is redundant. Query optimizer introduces it to significantly reduce the number of tuples to join with dept table. Should that optimization be omitted the output wont have changed. However, performance would have decreased dramatically.

 

gA, agg(D)(R(A,B) B=C S(C,D)) =

gA, agg(D)(R(A,B) B=C (gA, agg(D)S(C,D)))

 

 

d.                       

                                                           i.      Query 1

  select d.region, dp.did
   from depts as d
   join depts_prods as dp on d.did = dp.did

 

Estimated Running Time: 21.7 seconds

Actual Running Time: 20 seconds

Estimated Tuples: 2,643,293

Actual Tuples: 3,050,663

 

                                                        ii.      Query 2

   select count(*)
   from depts as d
   join depts_prods as dp on d.did = dp.did
 

 

Estimated Running Time: 16.1 seconds

Actual Running Time: 2 seconds

 

The second query performs better because of the aggregation on depts_prods being used by optimizer before join. This aggregation is useful to reduce the number of tuples for a join which significantly affects performance. A temporary table contains the number of tuples in depts_prods for each unique did which is later used to join with depts and to perform a final count.

 

This optimization could not be applied to the first query since it outputs d.region, dp.did for each tuple and therefore needs each tuple of both tables to be joined first before the output could be computed. That results in a large join since both tables are large.

 

However, should the first query contain distinct or any other operation which would allow optimizer to reduce a number of rows for a join it would result in a similar optimization technique to be applied.

 

e.                       

                                                           i.      Query 1

select d.region, avg(p.price) as AvgPrice

   from prods as p 
 join depts_prods as dp on dp.pid = p.pid
  join depts as d on d.did = dp.did

group by d.region

 

Output:

region AvgPrice

------------------------------ --------------------------

Central 599.92060525118245

Mountain 498.32875574601167

South 400.95740872043115

NorthWest 200.12154503004854

East 700.28072857480288

Canada 799.53329222268451

SouthWest 298.5248630158319

 

(7 row(s) affected)

 
  Estimated Running Time: 40.9 seconds
 Actual Running Time: 7-30 seconds
  Join Method: Both Hash Join
 

                                                        ii.      Query 2

  select d.region, dp.did
   from depts as d
   join depts_prods as dp on d.did = dp.did

 

Estimated Running Time: 19.7 seconds

Actual Running Time: 17 seconds

Estimated Tuples: 2,567,569

Actual Tuples: 3,050,663

 

                                                    iii.      Query 3

   select count(*)
   from depts as d
  join depts_prods as dp on d.did = dp.did

 

Estimated Running Time: 8.36 seconds

Actual Running Time: less than a second

 

Estimated cost for Query 3 is almost a half less as well as actual execution time.

 

2.     Query Optimization

a.     Block Nested Loop join

                                                           i.      V(R,A) = 10000

Cost = B(R)+(B(s(R))*B(S)) = 10,000 + 10,000 = 20,000

 

                                                        ii.      V(R,A) = 1

a.     M = 5002

Cost = B(R)+(B(s(R))*B(S)/(M-2)) =

= 10,000+((10,000*10,000)/(50022)) = 30,000

b.     M = 101

Cost = B(R)+(B(s(R))*B(S)/(M-2)) =

= 10,000+((10,000*10,000)/(1012))= 1,020,101

 

b.     Partitioned Hash join

                                                           i.      V(R,A) = 1

Cost = B(R)+(2*B(s(R))+3*B(S)) = 10,000+(2*10,000+3*10,000)=60,000

                                                        ii.      V(R,A) = 10,000

Cost = B(R)+(2*B(s(R))+3*B(S)) = 10,000+(2*1+3*10,000)=40,002

 

c.     Index join

                                                           i.      V(R,A) = 1

Cost = B(R)+4*T(s(R))*B(S)/V(S,C)=10,000+(10,000*4*10,000/10,000)=50,000

                                                        ii.      V(R,A) = 10,000

Cost = B(R)+4*T(s(R))*B(S)/V(S,C)=10,000+(1*4*10,000/10,000)=10,004

 

d.     Recommendations

 

V(R,A) = 1

 

V(R,A) = 10,000

M=101

Index join

Index join

M=5002

Block Nested Loop join

Index join

3.     Recovery

a.      

                                                           i.      <START CKPT T1>

                                                        ii.      <START CKPT (T2, T3)>

                                                    iii.      <START CKPT (T4, T5)>

 

b.     X4 = 6, X5 = 7

 

c.     Recovery Engine needs to read up to <START CKPT (T2, T3)>

 

 



[1] Estimated and actual values may differ depending on SQL Server version, machine being used, SQL Server current load and statistics