Solution to Homework 4

 

Michael Gubanov, mgubanov@cs

 

 

1.     Query Execution

 

a.                       

                                                           i.      Query 1

                           select region, count(d.did) as DCount

                                  from depts as d

group by region

 

Estimated Running Time:    1.77 seconds

Actual Running Time:              less than a second[1]

 

                                                        ii.      Query 2

                           select region, count(d.did) as DCount

                                  from depts as d

        join depts_prods as dp on d.did = dp.did

                           group by region

 

Estimated Running Time:    16.7 seconds

Actual Running Time:              4 seconds

Join Method:               Hash Join

 

b.                       

                                                           i.      Query 3

select d.region, avg(p.price) as AvgPrice

                         from prods as p 
                                  join depts_prods as dp on dp.pid =  p.pid
                                  join depts as d on d.did = dp.did

        group by d.region

 

  Output:

               region                          AvgPrice                                                    

               ------------------------------ --------------------------

               Central                        599.92060525118245

               Mountain                       498.32875574601167

               South                          400.95740872043115

               NorthWest                      200.12154503004854

               East                           700.28072857480288

               Canada                         799.53329222268451

               SouthWest                      298.5248630158319

 

               (7 row(s) affected)

 
              Estimated Running Time:       41.4 seconds
     Actual Running Time: 7 seconds
              Join Method:         Both Hash Join

c.                       

The right aggregate operator is redundant.  Query optimizer introduces it to significantly reduce the number of tuples to join with dept table. Should that optimization be omitted the output wont have changed. However, performance would have decreased dramatically.

 

gA, agg(D)(R(A,B) B=C S(C,D)) =

gA, agg(D)(R(A,B) B=C (gA, agg(D)S(C,D)))

 

 

d.                       

                                                           i.      Query 1

                          select d.region, dp.did
                                   from depts as d
                                   join depts_prods as dp on d.did =  dp.did

 

                 Estimated Running Time:    21.7 seconds

                 Actual Running Time:              20 seconds

                 Estimated Tuples:          2,643,293

                 Actual Tuples:                    3,050,663

 

                                                        ii.      Query 2

                select count(*)
                                   from depts as d
                                   join depts_prods as dp on d.did =  dp.did
 

 

                    Estimated Running Time:    16.1 seconds

                    Actual Running Time:              2 seconds

 

The second query performs better because of the aggregation on depts_prods being used by optimizer before join. This aggregation is useful to reduce the number of tuples for a join which significantly affects performance. A temporary table contains the number of tuples in depts_prods for each unique did which is later used to join with depts and to perform a final count.

 

This optimization could not be applied to the first query since it outputs d.region, dp.did for each tuple and therefore needs each tuple of both tables to be joined first before the output could be computed. That results in a large join since both tables are large.

 

However, should the first query contain distinct or any other operation which would allow optimizer to reduce a number of rows for a join it would result in a similar optimization technique to be applied.

 

e.                       

                                                           i.      Query 1

select d.region, avg(p.price) as AvgPrice

                         from prods as p 
                                  join depts_prods as dp on dp.pid = p.pid
                                  join depts as d on d.did = dp.did

        group by d.region

 

  Output:

               region                          AvgPrice                                                    

               ------------------------------ --------------------------

               Central                        599.92060525118245

               Mountain                       498.32875574601167

               South                          400.95740872043115

               NorthWest                      200.12154503004854

               East                           700.28072857480288

               Canada                         799.53329222268451

               SouthWest                      298.5248630158319

 

               (7 row(s) affected)

 
              Estimated Running Time:       40.9 seconds
     Actual Running Time: 7-30 seconds
              Join Method:         Both Hash Join
 

                                                        ii.      Query 2

                          select d.region, dp.did
                                   from depts as d
                                   join depts_prods as dp on d.did =  dp.did

 

                 Estimated Running Time:    19.7 seconds

                 Actual Running Time:              17 seconds

                 Estimated Tuples:          2,567,569

                 Actual Tuples:                    3,050,663

 

                                                    iii.      Query 3

                select count(*)
                                   from depts as d
                                  join depts_prods as dp on d.did =  dp.did

 

                    Estimated Running Time:    8.36 seconds

                    Actual Running Time:              less than a second

 

Estimated cost for Query 3 is almost a half less as well as actual execution time.

 

2.     Query Optimization

a.     Block Nested Loop join

                                                           i.      V(R,A) = 10000

Cost = B(R)+(B(s(R))*B(S)) = 10,000 + 10,000 = 20,000

 

                                                        ii.      V(R,A) = 1

a.     M = 5002

Cost = B(R)+(B(s(R))*B(S)/(M-2)) =

  = 10,000+((10,000*10,000)/(5002–2)) = 30,000

b.     M = 101

Cost = B(R)+(B(s(R))*B(S)/(M-2)) =

  = 10,000+((10,000*10,000)/(101–2))= 1,020,101

 

b.     Partitioned Hash join

                                                           i.      V(R,A) = 1

Cost = B(R)+(2*B(s(R))+3*B(S)) = 10,000+(2*10,000+3*10,000)=60,000

                                                        ii.      V(R,A) = 10,000

Cost = B(R)+(2*B(s(R))+3*B(S)) = 10,000+(2*1+3*10,000)=40,002

 

c.     Index join

                                                           i.      V(R,A) = 1

Cost = B(R)+4*T(s(R))*B(S)/V(S,C)=10,000+(10,000*4*10,000/10,000)=50,000

                                                        ii.      V(R,A) = 10,000

Cost = B(R)+4*T(s(R))*B(S)/V(S,C)=10,000+(1*4*10,000/10,000)=10,004

 

d.     Recommendations

 

V(R,A) = 1

 

V(R,A) = 10,000

M=101

Index join

Index join

M=5002

Block Nested Loop join

Index join

3.     Recovery

a.      

                                                           i.      <START CKPT T1>

                                                        ii.      <START CKPT (T2, T3)>

                                                    iii.      <START CKPT (T4, T5)>

 

b.     X4 = 6, X5 = 7

 

c.     Recovery Engine needs to read up to <START CKPT (T2, T3)>

 

 



[1] Estimated and actual values may differ depending on SQL Server version, machine being used,  SQL Server current load and statistics