CSE 326, Spring 1997: Homework 7
Due 5/22/98
- (20 points)
The purpose of this problem is to analyze and compare B-trees and
extendible hashing as two alternatives for maintaining a large data base.
Imagine that you have been asked to to implement a new IRS database.
There are 50,000,000 records, each of which take an average of 2,000 bytes.
The records are keyed with a Social Security Number which is 4 bytes.
The computer that you will be using has 2,048 byte pages and pages are
addressed with 4 byte integers.
Assume that the computer has 16 MB of memory that is usable for storing
all or part of a search structure.
-
For extendible hashing, keys are mapped to 4 byte hash values
by a one-to-one mapping. So each Social Security Number has a unique
hash value.
Each page in extendible hashing contains a number of entries, each of which
has a complete hash value together with a disk address of an actual
record. In addition, each page contains a one byte integer indicating
what portion of the complete hash value is relevant to the page.
The hash table that fits in main memory has 2^k entries for some k.
An index into the hash table is interpreted as a prefix of a complete
hash value. An entry in the hash table contains a disk address.
-
Calculate the largest extendible hash table that can fit in 16 MB
(2^{24} bytes) of
memory.
-
Calculate the maximum number of entries that can fit
on a page on disk.
-
For the largest extendible hash table that
can fit in memory explain how to
calculate the probability that it would overflow
with the 50,000,000 records.
-
For the B-tree, disk addresses are used for pointers, and a 2 byte integer
is used to keep track of the length of the active area in the B-tree node.
The leaves of the B-tree contain pointers to the actual records. We
assume that the root is about half full and that the nodes of the
B-tree are about 3/4-th full on average.
-
Calculate the maximum order of the B-tree so that the each node fits into
a page.
-
Calculate the height of the B-tree with that order so that
all the 50,000,000 records can be accessed.
-
Calculate how many levels
of the B-tree can fit into the memory of the computer. Several full levels
and the portion of one level will fit into memory.
-
Calculate how many
disk accesses in the worst case
are necessary to find a specific record using this search
structure. How many on average?
-
For this application what data structure would you recommend for accessing
data in the new IRS database. Justify your answer. (There is no absolute
best answer for this question, so you might consider listing the pros and cons
for each implementation.)