Relation Query Optimization

UNIVERSITY OF WASHINGTON

CSE 594: DATABASE MANAGEMENT SYSTEMS

Homework 4: Relational Query Optimization and DB creation
Due: August 5, 1998, at beginning of class. No late submissions accepted.

This homework has two parts. In part one, you will use SQL Server to create a relational database corresponding to the E/R diagram you drew for Homework 1. In part two, you will experiment with a tool that allows you to visualize the process of relational query optimization.

1. SQL SERVER Database: Create relations to model the data described in your E/R diagram from Homework 1. Use SQL Server. There are two items you will need now that you did not necessarily have in the E/R diagram: Domain information and functional dependencies. As much information as possible should be transferred from the E/R diagram to your SQL Server database. Where something cannot be transferred, you'll need to make note of that that and explain why.

NOTES: The goal here is NOT to have you become SQL Server experts, but rather to give you practice applying the logical and physical design techniques we've studied. Do not feel obligated to use every bell and whistle you can find in the DBMS and do not feel obligated to go out and buy a book on the DBMS. If there is something you're trying to do that's giving you fits, please ask me and we'll do something about it. You can also check out the online documentation (SQL Server Books Online, which is an option at the same level as the Enterprise Manager). You will be populating (entering tuples into) and querying the database as part of a later assignment, but if you want to populate it now, that's fine too.

Some guidelines you must follow:

Every relation should have a primary key.
You should know the FDs and normal form for each relation.
Unique constraints and foreign key constraints must be specified where needed.
You must use at least one "check" constraint somewhere in the database.
You must use at least one "trigger" somewhere in your database to enforce a multi-table integrity constraint.
You must create at least two indexes in the database in addition to the primary key indexes that SQL Server creates for you automatically.

What to hand in:

For each relation, the create table statement (or its equivalent). Be sure to include primary and foreign key and check constraints. Also write down any functional dependencies that hold on the relation and the highest normal form (considering just 1NF, 2NF, 3NF, and BCNF) satisfied by the relation.
For each index, the create index statement (or its equivalent). Include information about whether it is clustered or non-clustered and whether any other index parameters were used (fill factor, etc.). Also for each index, including the primary key indexes, explain why you created the index in the first place and why you chose the characteristics you did (fill factor, clustering, etc.).
For each trigger, the create trigger statement (or its equivalent) and any explanation of the trigger you feel is necessary.
For anything that you could not model in SQL Server, describe it and explain why it couldn't be modeled in the DBMS.

2. Query Processing and Optimization:

Introduction

In this question, you will carry out a number of exercises involving the optimization of relational queries using an optimizer visualization tool. You will not have to write any code, and your answers should be turned in as a text file (or handwritten).

Available Documentation

The chapter on optimization and the lecture transparencies should provide all the background information that you need on query optimization. The tool itself has an on-line help feature, and additional information is included here as part of exercise descriptions.

Setting Up Minibaseview

To use the Minibase query optimization viewer, you must have the same three lines in your .cshrc that you added for the previous homework assignment. If you did that, you're ready to go. The program to run is called minibaseview.

Getting Started

These exercises are based on two catalogs, which are located in files catalogA and catalogB. Each exercise says which catalog(s) to use. The first step of each exercise should be to open the appropriate catalog, which you do by choosing Open from the Catalogs menu of the visualization tool. Then you enter your query by choosing Enter SQL from the Query menu of the visualization tool.

If you want copies of the catalog files (not necessary; we can all share the same one), you can copy them. The files to copy, if you so desire, are:
catalog-format.txt: A description of the format of the catalogs.
catalogA, catalogB: Catalog files.
All three files are in the directory /projects/instr/cse594/mini_hwk/assign/Optimize. Note that even if you copy these, minibaseview will use the shared ones by default.

Catalogs A and B are based on the same schema. There are three relations, Emp, Dpt, and Job, which represent employees, departments, and job titles of some organization. Each of these relations has an integer field which is used as the primary key for the relation---empid for the Emp relation, dno for the Dpt relation, and jno for the Job relation. The Emp relation has foreign keys referring to Dpt and Job, representing each employee's department and job title, and the Dpt relation has a foreign key referring to Emp, representing the department's manager.

The Exercises

This question gets you to explore ``Level 0'' of the optimizer visualization display. This level shows all the access methods for each relation involved in the query. If you double-click the LEFT mouse button on one of these access method boxes, you can also see a list of the relation's attributes with the range of values each attribute takes on.
Open Catalog A and run the tool on the following query.
select * from dpt
where mgrid = 20223

Answer the following questions, looking at level 0 access methods.
1. What are the attributes of the Dpt relation?
2. How many tuples are in the Dpt relation?
3. How many different values of the mgrid attribute are there in the Dpt relation?
4. What is the largest manager rid (i.e., mgrid) held by a manager?
Level 1 of the optimizer visualization tool shows plans that consist of walking a single relation using one of the access methods defined for that relation. If you double-click the LEFT mouse button on a level 1 plan (and higher levels too), its box expands to show you more detailed information about the plan and the estimates made about it.
Use the same catalog, Catalog A, and the following query:
select * from emp
where sal = 58000

Answer the following questions, looking at level 1 plans.
1. What is the estimated total cost of running the (estimated) best plan?
2. What is the sorting cost associated with this plan? What would the sort cost be for any plan for this query?
3. What is the estimated result cardinality for this plan?
4. Given your knowledge about the number of different sal values, is the estimate the optimizer makes a reasonable one? Explain very briefly.
If you double-click the RIGHT mouse button on a Level 1 node (similarly, for higher level nodes), the tool opens a separate window and displays the entire plan denoted by that node.

Use the same catalog, Catalog A, and the following query:
select * from emp
where ename = "Frank"

Once again, open Catalog A and run this query. Answer the following questions, looking at both Level 1 plans and Level 0 access methods.
1. Find the (estimated) best plan. Which access method does this plan use?
2. In what order would the tuples be returned by this plan? Why?
3. Disable the access method used by the best plan by double-clicking the RIGHT mouse button on its Level 0 box. (The box will then be displayed in black, on a color monitor.) Which access method does the optimizer consider to be the best now?
4. Compare the new best plan with the other plans still shown. What is it about the way the best plan accesses tuples that makes it substantially cheaper than the others?
Open Catalog A, and run the tool on the following query.
select * from emp
where empid < 25000

Answer the following questions.
1. How many Emp tuples does the optimizer think there are?
2. How many employees whose id number is less than 25000 does the optimizer think there are?
3. How does the optimizer arrive at its estimate of the number of employees whose id number is less than 25000? That is, what calculations does it perform, and where does the supporting data come from?
Open Catalog A, and run the tool on this query.
select ename, dname from emp, dpt
where emp.dno = dpt.dno and emp.jno < 75

Answer the following questions, looking at the best plan.
1. What kind of join is the best plan?
2. How many tuples does the optimizer estimate it will retrieve from the Employee relation? At what level is this information shown?
3. What calculation does the optimizer do to arrive at the reduction factor estimate for this node? Where does the underlying information come from?
4. How much memory does the optimizer estimate it will take to retrieve the tuples needed from the Employee relation?
5. How much memory does the optimizer estimate it will it take to run the whole query?
Levels 2 and higher of the optimizer tool show the result of joining a lower-level plan with an access method. If you double-click the RIGHT mouse button on a Level 2 plan, the tool opens a window showing the lower-level plans and access methods that this plan is made up of.
Open Catalog A, and run the following query.
select * from emp, job
where emp.jno = job.jno

Answer the following questions looking at Level 2 plans. (NOTE: be sure not to confuse LEVEL 2 with the line NUMBERED 2. The NUMBER of the line of plans you are to look at is 3, the top line of the display.)
1. What kind of join is the best plan?
2. What is the estimated total cost of running this plan?
3. How many tuples does the optimizer estimate will result from this query?
4. How big are the tuples that result from this query? (i.e., how many bytes does each tuple occupy?)
5. How many pages will be filled by the result tuples?
6. What access methods does the best plan use?
Open Catalog B, then answer the following questions referring to this query:
select ename, dname from emp, dpt, job
where emp.dno = dpt.dno and emp.jno = job.jno
and dpt.dname = job.jname
1. Run this query and describe the best plan: list the joins and access methods it uses, and the order in which the relations are joined.
2. Modify the query by replacing the last `=' (comparing names) with `<', and optimize it again. Describe the best plan.
3. Explain why there is such a big difference in the plans produced for the query and for its (slightly) modified version.
From the Switches / Toggle Joins menu of the optimizer visualization tool, you can disable and re-enable the different kinds of joins that the optimizer will consider.
Also, if you double-click the RIGHT mouse button on the box that labels any row of the tool's display, you will turn off (and back on) the pruning of plans that the optimizer estimates are not the `least expensive'. (Remember that for each subset of relations, for each ordering of generated tuples, only the least cost plan is retained. The toggling referred to here controls this pruning; it should be used with care since the number of retained plans can increase rapidly without pruning!)
Run the following variant of the query from the previous exercise, using Catalog B as before.
select ename, dname from emp, dpt
where emp.dno = dpt.dno and emp.jno < 100

Now suppose you knew that the run-time environment for this query was limited to 10 pages of memory. Find the best (left-deep) plan that fits this requirement and describe its join method, access methods, and total cost.