review 7

From: L. Yan (lanti@u.washington.edu)
Date: Tue May 18 2004 - 23:03:01 PDT

  • Next message: Charles Giefer: "Query Evaluation Techniques"

    Query Evaluation Techniques for Large Databases

    - Li

    This is really a long, long survey, given the abundance of
    publications in query evalation techniques. Also it is a 1993 paper with a
    context ten years ago. It would be interesting to have a comparison with
    the latest development. Efficient query execution lies in the heart of
    database management systems. Good data models won't get wide acceptance if
    query execution speed is not tolerable. Maybe that's one reason that OODB
    is not as successful as many advocates have foreseen. The DBMS exports
    logical operator interface but performance has to be measured with
    physical operators in mind. The two might be thought as specification and
    implementation, the what and how. As most queries are asking for some sort
    of matching, two general approaches, sorting and hashing are the natural
    choice. Sorting is in general more expensive but the result has the
    desired property of interesting ordering in some cases, which might help
    further execution of subplan or the order might be imposed by user with a
    ORDER BY in SQL. Hashing normally has a faster execution speed since each
    matching can be done with O(1) time, but it is probably hard to have a
    good hash function and maintain hash table with the records close to
    uniformly distributed. This might introduce dependency on certain
    statistics of a table. Another concern is performance degradation upon
    constant modification of data. Hash collision is another problem which
    might lead to false matching if two different records were placed into the
    same bucket. It appears that sorting works by introducing an ordering of
    records while hashing does the job by partitioning the records. Sorting
    reaches a finer granularity since only records with the same comparing
    attribute values are grouped together, and also an order is maintained in
    the output. Thus sorting is inheritly more expensive than hashing and
    hashing should be preferred unless the output possesses an interesting
    ordering which might compensate for the price paid. The cost measure in
    database algorithm analysis is usually taken as the disk I/O operations
    required, contrast to the RAM model in the standard algorithm textbook,
    which assumes all operation can be done in memory. Since disk I/O is
    expensive, some techniques were introduced to reduce this number. The
    exploit of index structure effectively reduced the number of data pages
    needs to be fetched significantly while still carry enough information to
    perform attribute comparsion. However, the cost of creating and
    maintaining indices must be taken into account. If an algorithm requires
    building indices repeated on dynamic data, it would probably not perform
    as well as ignoring the cost of index construction. Read ahead is a
    heuristic method, and it works particularly well if data are stored
    contiguously and large amounts of data needs to be fetched. This will
    reduce the cost in disk arm positioning and rotation delay significantly.
    But it might not worth the effort if only a few tuples are needs during
    disk read. In aggregation and binary join, sorting and hashing are
    preferred to the naive nested loop because sorting and hashing results in
    less volume of intermediate results. But if one of the operand relation is
    very small such that the built in relation fits in main memory, nested
    loop could do just as well, with additional merit as the most versatile
    method in performing join. In section 8 there is an exciting idea about
    maintaining a decision tree of possible query plans. Although it is
    expensive to maintain such a structure, I am hoping this will be the right
    direction to go in the near future, since some information is not
    available until run-time, when certain intermediate results are available,
    also, the cost estimator could be a lot less complex since a rough
    estimate suffices to eliminate the obvious unpromising plans. Section 11.1
    talks about bit vector. The same idea that OS used for disk free page
    management. Bit vector looks similar to indices, in the sense that it
    serves as a locator for bulk data. Therefore, after processing the bit
    vector or indices, only a small amount of data needs to be fetched. The
    nested relation in section 12.4 is a result of OO database. It is
    currently expensive to maintain such a complicated structure and queries
    are not executed efficiently as compared with their relational
    counterparts. But they have to make it as efficient if OODB
    is to take off someday.


  • Next message: Charles Giefer: "Query Evaluation Techniques"

    This archive was generated by hypermail 2.1.6 : Tue May 18 2004 - 23:03:02 PDT