Review: Query Evaluation Techniques for Large Databases.

From: Lucas Kreger-Stickles (lucasks@cs.washington.edu)
Date: Wed May 19 2004 - 12:01:31 PDT

  • Next message: Bhushan Mandhani: "paper 7 review"

    Review by Lucas Kreger-Stickles

        In this somewhat large tome, the author presents a large compilation
    of information and techniques regarding large scale Database mangement
    systems.

    -Sec. 1
        Here the author makes a distinction between the logical algebra of
    the system and the physical algebra of the system. The author indicates
    that the logical algebra of the system is closely related to the data
    model of the system and indicates what sort of querries can be
    executed. On the other hand the physical algebra of the system is
    system specific and deals with the details of how legal querries are
    executed. The author indicates that performance analysis makes no sense
    in the context of a logical algebra but makes sense in the context of a
    physical algebra.
        The other major point were were to read on concerened the idea of an
    iterator model. Here the issue is dealing with querries that involve
    multiple joins. The idea is that instead of executing each one of these
    independently (as seperate files or as seperate operating system
    processes) they can instead be invoked as needed. In order to do this
    the logical tree-style structure for the querry must be converted into
    an iterative structure. Such a process of transformation is non-trivial.
      
    -Sec. 2.1
        In this section we learn that sorting is an importatnt and frequent
    funtion for DBMSs. Furthermore, the appropriate sorting algorithm to
    use is one of a divide and conquer style approach. However, there is a
    contrast between merge-sort and quicksort (the two potentially
    appropriate sorting algorithms) in that one divides on physical keys and
    sorts on logical keys (merge sort) and the other does the opposit. This
    differnce in behavior creates different instances in which one may prove
    more efficent than the other.
        In addition, the concept of level-0, or initial runs is introduced
    for which there are two potential techniques:
           -1.) First an in memory sort is used, in which every run will use
    all of availible memory and the number of runs will be (availible RAM /
    size of data to sort)
           -2.) Second, one could use a technique known as replacement
    selection which is attributed to Knuth. In this technique one employs a
    priority heap to constantly have memory be full and fill files with the
    smallest thing in memory so long as that item is larger than the last
    item placed in a file. When a item is written to a file the next item
    brought in is either larger (in which case it can be written to the same
    file) or is smaller (in which cse it is tagged to go into the next file
    written)

    -Sec 2.2 Hashing
        Here we learn about hybird hashing which: "Starts with the
    optomistic assumption that no overflow will occur and if it does they
    then particition into multiple fiels only one of which is written to disk"

    -Sec 3.1 File Scan
        The author indicates that read ahead can improve this process but
    that Unix OSs do not support this forcing many Unix databses to provide
    OS style functionality

    -Sec 3.2
        Here the author discusses the advantage of associative acess using
    indeces, specifically discussing the advantages of the B-tree and its
    derivatives.

    -Sec 4.0 (1-4)
        Section 4 concerns agregation and duplicate removal based on three
    techniques: nested loops, sorting, and hashing. The performance
    tradeoffs for each of these approaches is then compared. Nested loops
    is the most intuitive approach but does not work for alarge data sets,
    sorting isn't as bad and can be improved by removing duplicates as
    quickly as possible, hashing can be quite fast but will slow down
    emmensily is partitioning has to occur.

    -Sec 5 (1-3) Binary matching operations
        Section 5 concerns all of the differnt classes and techniques
    related to joins. The major techniques discussed are the same for
    agregation and duplicate removal: namely, nested loops, merging, and
    hashing.

    -Sec 8
        Section eight concerns the exectution of complex querry plans and
    the techniques used to do so. One of the major focuses of this section
    is the performacne enhancement that is possible is one attemtpt to
    execture multiple subquerries in a concurrent and overlpaping fasion.

    Thoughts:
        Overall I thought that this paper was suprisingly easy to read
    (while at times a bit dense). I imaging that this serves as a very
    usefull reference to have around not only when one is trying to
    construct effective databases but also to get a nice consise glipse at
    what has been tried and what as succedded before in the relm of
    databases. In particular I felt that the text gave a good feel for the
    fct that when dealing with such large sets of data, the smallest of
    changes in how one does something can have dramatic effect on efficeny.
    Finnally, I was really interested to see how much the algorithms
    depended on system specific details (such as availible memory etc.) to
    run as efficently as possible.

    -Lucas Kreger-Stickles


  • Next message: Bhushan Mandhani: "paper 7 review"

    This archive was generated by hypermail 2.1.6 : Wed May 19 2004 - 12:01:36 PDT