big paper review

From: Aaron Chang (anc327@yahoo.com)
Date: Wed May 19 2004 - 04:43:35 PDT

  • Next message: Joe Xavier: "review : Query Evaluation Techniques for Large Databases"

    rev 7
    5-19-04

    The Big Paper

    sec 1 intro

    1. physical algebra: all query processing algorithms;
    the complete

    collection of operators and mechanisms to execute
    complex expressions in a

    query engine; system-specific; cost functions
       logical algebra: closely related to the data model;
    such the relational

    algebra (like SQL); typically the same across all
    systems; not directly

    executable; abstracted from system

    2. iterator model: implementation of all operators so
    that they are

    scheduled within one process; iteration of over
    intermediate granules; ex:

    print, scan, hash join, merge-join, sort

    sec 2.1 sorting

    3. all sorting algorithms use merge-sort
    (interesting) - must be for

    joins; quicksort cannot be used for datasets large
    than available memory;

    sorting algorithms on large data sets on disk use
    "dividing and combining"

    4. level 0 runs: in-memory quicksort or replacement
    selection (fill memory

    with items in a priority heap; used when db is larger
    than memory spaces)

    sec 2.2 hashing

    5. hashing can be used an alternative to sorting; use
    in-memory hash table

    (fast)
            bin overflow is handled by "avoidance or resolution"
    typically

    6. partitioning: dataset is partitioned before
    actually storing in hash

    table (avoidance approach)
       hybrid hashing: part of hash is in memory, the
    directory references on

    disk as well (resolution)

            -> seems like the latter is more realistic and
    practical

    sec 3.1 disk access

    7. first operation on db: "read-ahead": makes things
    fast, but can cause

    buffer fragmentation and bandwidth waste

    sec 3.2 indices

    8. indices are made by associative search techniques
    (map key values to

    locator info); B-trees (and their variants are used);
    other trees or hashes

    structures are less used; indices can be clustered or
    non-clustered

    sec 4.1 aggregation using nested loops

    9. simplest aggregation/dupe removal scheme: temp
    file stores data during

    each loop; inefficient for large input sets; it can do
    unusual joins, but

    not generally used because of performance weakness

    sec 4.2 agg. based on sorting

    10. remove dupes as early as possible

    sec 4.3 agg based on hashing

    11. amt of I/O for hash-agg's dependent on
    partitioning before output

    sec. 4.4 performance comparison

    12. both sort and hash agg methods are log-order

    sec 5 binary matching operators

    13. combining data: relational joins - most current db
    systems use nested
    loops and merge-joins only; hash-joins are considered
    more efficient

    sec 5.1 nested loops join

    14. simplest method; poor performance

    sec 5.2 merge join

    15. two inputs are sorted on join attribute

    sec 5.3 hash join

    16. use a hash table keying on one input, then looking
    for items of other

    input; cost of binary hash I/O ops is log

    overall, a pretty exhaustive coverage of the
    algorithms and theory behind
    databases internals. though it was long, the author is
    surprisingly clear
    on many topics. an interesting read, one that serves
    better as a reference
    than a short scientific interest.

            
                    
    __________________________________
    Do you Yahoo!?
    SBC Yahoo! - Internet access at a great low price.
    http://promo.yahoo.com/sbc/


  • Next message: Joe Xavier: "review : Query Evaluation Techniques for Large Databases"

    This archive was generated by hypermail 2.1.6 : Wed May 19 2004 - 04:43:42 PDT