big paper review

From: Aaron Chang (anc327@yahoo.com)
Date: Wed May 19 2004 - 04:43:35 PDT

Next message: Joe Xavier: "review : Query Evaluation Techniques for Large Databases"

Previous message: Atri Rudra: "Review #7"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

rev 7
5-19-04

The Big Paper

sec 1 intro

1. physical algebra: all query processing algorithms;
the complete

collection of operators and mechanisms to execute
complex expressions in a

query engine; system-specific; cost functions
logical algebra: closely related to the data model;
such the relational

algebra (like SQL); typically the same across all
systems; not directly

executable; abstracted from system

2. iterator model: implementation of all operators so
that they are

scheduled within one process; iteration of over
intermediate granules; ex:

print, scan, hash join, merge-join, sort

sec 2.1 sorting

3. all sorting algorithms use merge-sort
(interesting) - must be for

joins; quicksort cannot be used for datasets large
than available memory;

sorting algorithms on large data sets on disk use
"dividing and combining"

4. level 0 runs: in-memory quicksort or replacement
selection (fill memory

with items in a priority heap; used when db is larger
than memory spaces)

sec 2.2 hashing

5. hashing can be used an alternative to sorting; use
in-memory hash table

(fast)
bin overflow is handled by "avoidance or resolution"
typically

6. partitioning: dataset is partitioned before
actually storing in hash

table (avoidance approach)
hybrid hashing: part of hash is in memory, the
directory references on

disk as well (resolution)

-> seems like the latter is more realistic and
practical

sec 3.1 disk access

7. first operation on db: "read-ahead": makes things
fast, but can cause

buffer fragmentation and bandwidth waste

sec 3.2 indices

8. indices are made by associative search techniques
(map key values to

locator info); B-trees (and their variants are used);
other trees or hashes

structures are less used; indices can be clustered or
non-clustered

sec 4.1 aggregation using nested loops

9. simplest aggregation/dupe removal scheme: temp
file stores data during

each loop; inefficient for large input sets; it can do
unusual joins, but

not generally used because of performance weakness

sec 4.2 agg. based on sorting

10. remove dupes as early as possible

sec 4.3 agg based on hashing

11. amt of I/O for hash-agg's dependent on
partitioning before output

sec. 4.4 performance comparison

12. both sort and hash agg methods are log-order

sec 5 binary matching operators

13. combining data: relational joins - most current db
systems use nested
loops and merge-joins only; hash-joins are considered
more efficient

sec 5.1 nested loops join

14. simplest method; poor performance

sec 5.2 merge join

15. two inputs are sorted on join attribute

sec 5.3 hash join

16. use a hash table keying on one input, then looking
for items of other

input; cost of binary hash I/O ops is log

overall, a pretty exhaustive coverage of the
algorithms and theory behind
databases internals. though it was long, the author is
surprisingly clear
on many topics. an interesting read, one that serves
better as a reference
than a short scientific interest.

__________________________________
Do you Yahoo!?
SBC Yahoo! - Internet access at a great low price.
http://promo.yahoo.com/sbc/

Next message: Joe Xavier: "review : Query Evaluation Techniques for Large Databases"

Previous message: Atri Rudra: "Review #7"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Wed May 19 2004 - 04:43:42 PDT