Spring 2000

CSE 326: Program #2
due Friday, May 5th
B+ Tree Implementation

Implement a B+Tree ADT with three operations: find, insert, and range-query.

Insert will insert a data record into the B+Tree while maintaining proper B+Tree structure.
Find will return a data record given a key.
Range-query will take two arguments low and high, in the domain of the keys, and return a list of all data records with keys between low and high.

You will test your B+Tree implementation using a dataset provided with the project. The dataset will be on disk, but your B+Tree and the records to which its leaves point will be in memory. The tree will be initially empty. After inserting a large number of records, you will use the find and range-query operations to perform some simple selections of data records. You do not need to implement an elaborate user-interface to your B+Tree. Instead you will write a simple function (probably main) that will insert all the records, run the required selection operations, and output the results, all of which will be very simple once the B+Tree is in place.

The sample data is provided as a file in comma-separated-value (csv) format, and looks like this:

TITLE,FIRST NAME,LAST NAME,ID,EMPLOYER,OCCUPATION,DATE,
AGGREGATE_YTD,AMOUNT,CITY,STATE
Mrs.,John,Broussard,1086894,Info. Requested,Info. Requested,8/17/99,100,100,Abbeville,LA
Mr.,Dennis,Batteen,1037504,Self,Attorney,8/11/99,25,25,Aberdeen,SD
Hon.,William,Arnot,1015866,Info Requested,Info Requested,8/4/99,400,100,Abilene,TX
Mr.,Jeoffrey,Bodenhorst,1065745,Lebanon Apparel,Management,8/25/99,1000,1000,Abingdon,VA
Mr.,Norman,Audette,1019609,,Retired,8/24/99,20,20,Abington,MA

Details

Many of the implementation details will be left up to you. You are permitted (in fact, encouraged) to use STL classes like vector, but no tree-related classes. Your primary resource should be the notes handed out in class which contain pseudo-code for the B+Tree operations you are asked to implement.

The B+Tree will be indexed using the 'ID' field. In the sample data, this key is unique. (See the extensions below.)

The order of the B+tree should be five: each internal node will contain 4 keys and 5 pointers. This parameter of your tree should be a #define constant.

The leaf nodes of your B+Tree should contain 4 keys and 4 pointers to data records. The number of pointers should be a #define constant.

You will design a struct or class to contain a single record of data, consistent with the provided sample data. Also, you will probably want to write a simple function to display a single record.

Your B+Tree class does not have to be templatized: it can be designed specifically for the record structure you devise above.

You will be provided with a simple class, FileParser, to assist in the file I/O. This class will enable you to extract the fields of the data file one-by-one without commas. The fields will be returned sequentially until the end-of-file without regard for line breaks. It will therefore be your responsibility to collect a bunch of fields and form a data record, but the number of fields is fixed so this is easy.

We have included a more elaborate Makefile than previously. It is more complicated, but should be easier to use. It is well-commented so it should be clear how to add sources. It also has some handy additional targets like "make clean" which removes all object files and executables.

There are too many implementation details and design choices to list them all here. Feel free to raise general issues on the class list. If critical additional guidelines emerge, they will be added to this project description.

Final Output

Once you have your B+Tree working, you should run the following operations in main or some other function:

Find and display the data record with key value 1026156.
Find and display the data record with key value 1070741.
Find and display all data records with key values between 1062444 and 1062488, inclusive.
Find and display all data records with key values between 1055027 and 1055500, inclusive.
Find and display all data records with key values between 1016000 and 1017000, inclusive.

Turn-in procedures

You will be required to turn-in a paper copy of your code, as well as submit your source code electronically. Your code should include the following block at the top of each source file:

TITLE:
AUTHOR:
DATE:
PURPOSE:
ARGUMENTS:

and for each function or class:

TITLE:
PURPOSE:
ARGUMENTS:

In addition, you must include a detailed explanation of the structures they have designed in the comments at the top of your source files. The paper submission will take place in class on Friday May 5th, and the electronic submission should be completed BEFORE this time. Please remember that your code must compile on the Linux instructional machines, and you should submit a makefile along with your source code. Turn-in Instructions

Files

All the files for this project are included in this tar file:
project2.tar
Use "tar xvf project2.tar" unpack it.

Here are the individual files:
Source Files:
FileParser.cpp
FileParser.h
Makefile

Sample Data:
sample.csv

Extensions (optional)

Modify your B+Tree to accomodate indexing by redundant keys (such as AMOUNT or STATE). Test it.

Devise a convenient way to display the structure of your B+Tree. Implement it, at least for datasets of reasonable size.

Additional Notes

Handout about the dynamic_cast operator
Sample program using vector Output of the sample program