Query Evaluation Techniques

From: Charles Giefer (cgiefer@cs.washington.edu)
Date: Wed May 19 2004 - 01:22:08 PDT

  • Next message: Danny Wyatt: "query evaluation techniques for large databases"

    Query Evaluation Techniques for Large Databases
    Goetz Graefe

    This paper talks about what is happening inside the actual implementation of
    a database. The user writes a logical algebra to describe a query on a
    database, but the database needs to rewrite this into a physical algebra.
    This physical algebra is the system's actual plan for how to satisfy the
    user's request. The selection of the actual physical course to take is a
    very difficult decision. This is because there are many ways of
    implementing operations (such as joins), there are many ordering of the
    operations that can be selected, and there are considerations such as memory
    size and disk locality that need to be taken into account.

    Because databases can be very large, efficient mechanisms need to be
    implemented that read and write the files onto disk the fewest number of
    times. Also, it is desirable to operate on large local segments of data
    (approximately the size of memory) so that as much work can be done on this
    segment before it is replaced.

    There are a few major themes of efficient evaluation: sorting and hashing.
    Both can achieve efficient results in query processing. Sorting (merge sort
    is preferred) puts the data on disk sorted by the proper elements. If
    tables are joined over these elements by which they are sorted, a simple
    linear pass through each of the files is sufficient. This therefore reduces
    a potentially inefficient problem into a streaming calculation. Not only is
    this efficient on one processor, but since the relevant data is localized
    (due to sorting), the join can be done by several parallel processors each
    taking a chuck of the data. Hashing allows for similar locality but for
    different reasons. Hashing can be thought of as a lazy sort. It increases
    the locality of the data, just like sort, but can be done much faster.
    Joins on hashed tables need only check the hash bins that are applicable.

    It seems like clever sorting and hashing can solve most of the database
    efficiency issues. For example, aggregation seems like a difficult task.
    But when the data is localized by the "group by" elements, aggregation
    becomes a quick task. The only problem I saw was if the data being
    aggregated was very large. In this case, we might want to distribute the
    aggregation computation among several processors. Clever means of
    distributing the data to be computed are needed in order for this to be
    efficient.

    I found it interesting that intermediate steps could be avoided by clever
    processors. For instance, the last run of the merge sort algorithm can be
    combined with a merge join to accomplish them both simultaneously. This is
    similar in computer architecture to a processor doing out of order execution
    to increase parallelism. I was surprised when the paper pointed out that
    scheduling bushy trees in a multiprocessor system was a very hard problem.
    This seems counterintuitive since each branch of the tree seems like it
    could be independently executed. This, at least in my understanding, is
    different from a uneven tree where all the operations must happen
    sequentially on the data. Clearly, though, if "Decomposition" can achieve
    3-9 joins in each cycle of the database, this would favor long heavy-sided
    trees.

    It is surprising that they treat bit vector filtering as inherently
    different to hashing. It seems to me that bit vector filtering is simply a
    different way of implementing hashing to reap the benefits of the locality
    gained.

    The last thing that I found surprising in this paper was the deviation from
    first normal form. It said that using first normal form throws away work
    done to cluster the data. This is very true since the hash and sorting
    functions try to order the data to have "locality in meaning." First normal
    form does not allow this type of locality. Therefore, if the database is
    using non-first normal form behind the scenes, why is it not allowed in the
    actual user interface for the database? My thoughts are that the SQL
    language is too restrictive to express data that is not in first normal
    form. Also, while it may be easier to process data not in first normal
    form, it is much easier to store the data in first normal form on disk.

    Overall, this paper showed some interesting aspects of query efficiency. It
    also gave a good discussion of all the different facets of each technique.
    It, however, was a little dry and could have benefited from some interesting
    examples (like beer/drinkers/bar example).


  • Next message: Danny Wyatt: "query evaluation techniques for large databases"

    This archive was generated by hypermail 2.1.6 : Wed May 19 2004 - 01:22:17 PDT