From: L. Yan (lanti@u.washington.edu)
Date: Tue May 18 2004 - 23:03:01 PDT
Query Evaluation Techniques for Large Databases
- Li
This is really a long, long survey, given the abundance of
publications in query evalation techniques. Also it is a 1993 paper with a
context ten years ago. It would be interesting to have a comparison with
the latest development. Efficient query execution lies in the heart of
database management systems. Good data models won't get wide acceptance if
query execution speed is not tolerable. Maybe that's one reason that OODB
is not as successful as many advocates have foreseen. The DBMS exports
logical operator interface but performance has to be measured with
physical operators in mind. The two might be thought as specification and
implementation, the what and how. As most queries are asking for some sort
of matching, two general approaches, sorting and hashing are the natural
choice. Sorting is in general more expensive but the result has the
desired property of interesting ordering in some cases, which might help
further execution of subplan or the order might be imposed by user with a
ORDER BY in SQL. Hashing normally has a faster execution speed since each
matching can be done with O(1) time, but it is probably hard to have a
good hash function and maintain hash table with the records close to
uniformly distributed. This might introduce dependency on certain
statistics of a table. Another concern is performance degradation upon
constant modification of data. Hash collision is another problem which
might lead to false matching if two different records were placed into the
same bucket. It appears that sorting works by introducing an ordering of
records while hashing does the job by partitioning the records. Sorting
reaches a finer granularity since only records with the same comparing
attribute values are grouped together, and also an order is maintained in
the output. Thus sorting is inheritly more expensive than hashing and
hashing should be preferred unless the output possesses an interesting
ordering which might compensate for the price paid. The cost measure in
database algorithm analysis is usually taken as the disk I/O operations
required, contrast to the RAM model in the standard algorithm textbook,
which assumes all operation can be done in memory. Since disk I/O is
expensive, some techniques were introduced to reduce this number. The
exploit of index structure effectively reduced the number of data pages
needs to be fetched significantly while still carry enough information to
perform attribute comparsion. However, the cost of creating and
maintaining indices must be taken into account. If an algorithm requires
building indices repeated on dynamic data, it would probably not perform
as well as ignoring the cost of index construction. Read ahead is a
heuristic method, and it works particularly well if data are stored
contiguously and large amounts of data needs to be fetched. This will
reduce the cost in disk arm positioning and rotation delay significantly.
But it might not worth the effort if only a few tuples are needs during
disk read. In aggregation and binary join, sorting and hashing are
preferred to the naive nested loop because sorting and hashing results in
less volume of intermediate results. But if one of the operand relation is
very small such that the built in relation fits in main memory, nested
loop could do just as well, with additional merit as the most versatile
method in performing join. In section 8 there is an exciting idea about
maintaining a decision tree of possible query plans. Although it is
expensive to maintain such a structure, I am hoping this will be the right
direction to go in the near future, since some information is not
available until run-time, when certain intermediate results are available,
also, the cost estimator could be a lot less complex since a rough
estimate suffices to eliminate the obvious unpromising plans. Section 11.1
talks about bit vector. The same idea that OS used for disk free page
management. Bit vector looks similar to indices, in the sense that it
serves as a locator for bulk data. Therefore, after processing the bit
vector or indices, only a small amount of data needs to be fetched. The
nested relation in section 12.4 is a result of OO database. It is
currently expensive to maintain such a complicated structure and queries
are not executed efficiently as compared with their relational
counterparts. But they have to make it as efficient if OODB
is to take off someday.
This archive was generated by hypermail 2.1.6 : Tue May 18 2004 - 23:03:02 PDT