CSEP 544 Homework 3

Objectives:

To gain experience with XQuery, constraints, triggers, query execution and query optimization.

Number of points:

150

Due date:

Monday, May 17, 2004 5:45 pm

Turn-in:

Turn in a plain text file called solution.txt and query optimization diagrams in a PNG called optimization.png. A PDF combining both is also fine. The turn-in web form also allows you to turn in a zip file, which can contain other files if necessary. Alternatively, you may hand in a printout at the beginning of lecture.

Homework Problems:

1. Writing XML Queries [50 points]

For this part of the assignment, you will be writing XQueries and feed them into Galax. The two data instances are Mondial and Reed. Download mondial-3.0.xml and reed.xml (the .dtd files give you an idea of the structure, but is not necessary for the assignment). Write XQueries to answer the following English queries:

For each instructor, list all the courses he/she teaches in the format of [subject] [course number]: [title]. (e.g. <course> ANTH 211 Introduction to Anthropology </course>)
Find all the "over crowded provinces" in China, where "over crowded provinces" are defined to be the ones whose population density is greater than that of China itself.

For this question you need to run galax: a brief tutorial on how to do this is Here. Alternatively, you can easily install it yourself on any machine you want, from here (get version 0.3.0 for Win32).

2. Constraints, Assertions, Triggers [32 points]

Consider the following relational schema. An employee can work in more than one department; the pct time field of the Works relation shows the percentage of time that a given employee works in a given department.

Emp(eid: integer, ename: string, age: integer, salary: real) Works(eid: integer, did: integer, pct time: integer) Dept(did: integer, budget: real, managerid: integer)
Write integrity constraints (domain, key, foreign key, or CHECK constraints; or assertions) or triggers to ensure each of the following requirements, considered independently.

Employees must make a minimum salary of $1000.
Every manager must be also be an employee.
A manager must always have a higher salary than any employee that he or she manages.
Whenever an employee is given a raise, the manager’s salary must be increased to be at least as much.
Whenever an employee is given a raise, the manager’s salary must be increased to be at least as much. Further, whenever an employee is given a raise, the department’s budget must be increased to be greater than the sum of salaries of all employees in the department.

3. Query Execution/Optimization [48 points]

Consider the following scenario:

Dept(did: integer, projid: integer, budget: real, status: char(10)) Proj(projid: integer, code: integer, report: varchar)
Assume that each Dept record is 40 bytes long and each Proj record is 2000 bytes long on average. There are 5000 tuples in Dept (note that did is not a key by itself) and 1000 tuples in Proj. Each department, identified by did, has 10 projects on average. The file system supports 4000 byte pages, and 12 buffer pages are available. All following questions are based on this information. You can assume uniform distribution of values. State any additional assumptions. The cost metric to use is the number of page I/Os.

Consider the following query:

SELECT D.did, COUNT(*) FROM Dept D, Proj P WHERE D.projid=P.projid AND D.budget>99000 GROUP BY D.did
Assume that department budgets are uniformly distributed in the range 0 to 100,000.

Draw the most naive query execution tree.
Assuming no prior indices, what is the best join algorithm to use? Be sure to show that there is enough buffer space to run the algorithm you chose.
Using the algorithm you chose above, compute the cost of executing this join.
Now draw the best query execution plan you can think of for this query.
How does this change the cost of the join algorithm you chose above?
Now suppose that there is a clustered B+ tree index on {D.did, D.budget} and a hash index on {P.projid}. Describe the plan with the lowest estimated cost.

4. Query Optimization [20 points]

Consider the following relational schema, that includes two relations:

Movie(title, director, year, genre, length) Revenues(year, average)

Consider the following query, which asks for revenue on comedies per year:

SELECT year, Count(title) * Revenues.average AS YearTotal FROM Revenues, Movie WHERE Movie.year=Revenues.year AND genre=comedy GROUP BY year
Can you propose a transformations on the above query that is likely to reduce the cost of evaluating the query?
Suppose the query was modified to also return the number of comedies made per year, i.e.,

SELECT year, Count(title) * Revenues.average AS YearTotal, Count(*) FROM Revenues, Movie WHERE Movie.year=Revenues.year AND genre=comedy GROUP BY year
Would the above optimizations you proposed above still be valid? Why or why not?