From: Lucas Kreger-Stickles (lucasks@cs.washington.edu)
Date: Wed May 19 2004 - 12:01:31 PDT
Review by Lucas Kreger-Stickles
In this somewhat large tome, the author presents a large compilation
of information and techniques regarding large scale Database mangement
systems.
-Sec. 1
Here the author makes a distinction between the logical algebra of
the system and the physical algebra of the system. The author indicates
that the logical algebra of the system is closely related to the data
model of the system and indicates what sort of querries can be
executed. On the other hand the physical algebra of the system is
system specific and deals with the details of how legal querries are
executed. The author indicates that performance analysis makes no sense
in the context of a logical algebra but makes sense in the context of a
physical algebra.
The other major point were were to read on concerened the idea of an
iterator model. Here the issue is dealing with querries that involve
multiple joins. The idea is that instead of executing each one of these
independently (as seperate files or as seperate operating system
processes) they can instead be invoked as needed. In order to do this
the logical tree-style structure for the querry must be converted into
an iterative structure. Such a process of transformation is non-trivial.
-Sec. 2.1
In this section we learn that sorting is an importatnt and frequent
funtion for DBMSs. Furthermore, the appropriate sorting algorithm to
use is one of a divide and conquer style approach. However, there is a
contrast between merge-sort and quicksort (the two potentially
appropriate sorting algorithms) in that one divides on physical keys and
sorts on logical keys (merge sort) and the other does the opposit. This
differnce in behavior creates different instances in which one may prove
more efficent than the other.
In addition, the concept of level-0, or initial runs is introduced
for which there are two potential techniques:
-1.) First an in memory sort is used, in which every run will use
all of availible memory and the number of runs will be (availible RAM /
size of data to sort)
-2.) Second, one could use a technique known as replacement
selection which is attributed to Knuth. In this technique one employs a
priority heap to constantly have memory be full and fill files with the
smallest thing in memory so long as that item is larger than the last
item placed in a file. When a item is written to a file the next item
brought in is either larger (in which case it can be written to the same
file) or is smaller (in which cse it is tagged to go into the next file
written)
-Sec 2.2 Hashing
Here we learn about hybird hashing which: "Starts with the
optomistic assumption that no overflow will occur and if it does they
then particition into multiple fiels only one of which is written to disk"
-Sec 3.1 File Scan
The author indicates that read ahead can improve this process but
that Unix OSs do not support this forcing many Unix databses to provide
OS style functionality
-Sec 3.2
Here the author discusses the advantage of associative acess using
indeces, specifically discussing the advantages of the B-tree and its
derivatives.
-Sec 4.0 (1-4)
Section 4 concerns agregation and duplicate removal based on three
techniques: nested loops, sorting, and hashing. The performance
tradeoffs for each of these approaches is then compared. Nested loops
is the most intuitive approach but does not work for alarge data sets,
sorting isn't as bad and can be improved by removing duplicates as
quickly as possible, hashing can be quite fast but will slow down
emmensily is partitioning has to occur.
-Sec 5 (1-3) Binary matching operations
Section 5 concerns all of the differnt classes and techniques
related to joins. The major techniques discussed are the same for
agregation and duplicate removal: namely, nested loops, merging, and
hashing.
-Sec 8
Section eight concerns the exectution of complex querry plans and
the techniques used to do so. One of the major focuses of this section
is the performacne enhancement that is possible is one attemtpt to
execture multiple subquerries in a concurrent and overlpaping fasion.
Thoughts:
Overall I thought that this paper was suprisingly easy to read
(while at times a bit dense). I imaging that this serves as a very
usefull reference to have around not only when one is trying to
construct effective databases but also to get a nice consise glipse at
what has been tried and what as succedded before in the relm of
databases. In particular I felt that the text gave a good feel for the
fct that when dealing with such large sets of data, the smallest of
changes in how one does something can have dramatic effect on efficeny.
Finnally, I was really interested to see how much the algorithms
depended on system specific details (such as availible memory etc.) to
run as efficently as possible.
-Lucas Kreger-Stickles
This archive was generated by hypermail 2.1.6 : Wed May 19 2004 - 12:01:36 PDT