CSE 594

Autumn 1999

Project Part 1

The Minirel File Manager

Due Wednesday November 10 at beginning of class

Introduction

In this part of the project, you will implement a File Manager for Heap Files that also provides a scan mechanism that will allow you to search a heap file for records that satisfy a search predicate called a filter. How are heap files different from the files that the DB layer of Minirel provides ? The main difference is that the files at the DB layer are physical (a bunch of pages) and not logical. The HeapFile layer imposes a logical ordering on the pages (via a linked list). In addition, while both the DB and Buffer Manager layers consider pages to be simply a chunk of bytes, the Heap File layer imposes a slotted-page structure on top of each page. This is necessary for accommodating variable length records.

Classes

The Heap File layer contains three main classes: a Page class for describing the structure and the operations that can be performed on a slotted page, a HeapFile class for inserting and deleting records from a heap file, and a HeapFileScan class (derived from the HeapFile class) for scanning a heap file (with an optional predicate). You will implement the HeapFile and HeapFileScan classes. The detailed description of each of these classes follows.

The Page Class

The following is the definition of the Page class (given in page.h). If you like, you may read Section 7.6 of the text for a little background, but it's not necessary since you will only be using, not implementing, the slotted page structures.

const unsigned PAGESIZE = 1024;
const unsigned DPFIXED= sizeof(slot_t)+4*sizeof(short)+3*sizeof(int);

private:
    char data[PAGESIZE - DPFIXED];
    slot_t slot[1]; // first element of slot array - grows backwards!
    short slotCnt; // number of slots in use;
    short freePtr; // offset of first free byte in data[]
    short freeSpace; // number of bytes free in data[]
    short dummy; // for alignment purposes
    int prevPage; // backwards pointer
    int nextPage; // forwards pointer
    int curPage; // page number of current pointer

public:
void init(const int pageNo); // initialize a new page
void dumpPage() const; // dump contents of a page

const int getNextPage() const; // returns value of nextPage
const int getPrevPage() const; // returns value of prevPage

// sets value of nextPage to pageNo
void setNextPage(const int pageNo);

// sets value of prevPage to pageNo
void setPrevPage(const int pageNo);

// inserts a new record (rec) into the page, returns RID of record
const Status insertRecord(const Record & rec, RID& rid);

// delete the record with the specified rid
const Status deleteRecord(const RID & rid);

// returns RID of first record on page
const Status firstRecord(RID& firstRid) const;

// returns RID of next record on the page
const Status nextRecord (const RID & curRid, RID& nextRid) const;

// returns reference to record with RID rid
const Status getRecord(const RID & rid, Record & rec);
};

Description of Private Data Members of Page Class

The Page class uses a slotted page approach for organizing variable length records. When a record is in memory, it is convenient for the higher layers to deal with a record in the following form.

struct Record
{
void* data;
int length;
};

However on a page, records are stored differently. Only the data parts of records are stored one after another starting from byte 0 of a page while the slot array (which contains pointers to the start of records) grows from the bottom of the page towards the top. There are several important things to notice. First the data[] area in a page is used to hold both records and all but the 0th element of the slot array! Secondly, since the slot array grows towards the "top", it is indexed with the values 0, -1, -2, -3, ... and not 0, 1, 2, 3 ... This is very important ! Each element of the slot array has the structure:

struct slot_t {
short offset;
short length; // equals -1 if slot is not in use
};

The offset records the start of the record from the beginning of data[] and the length equals its length. The slotCnt attribute is used to indicate the current end of the slot array. It is initialized to 0 (by the init() member function) and everytime a record is added to the page, its value is decremented by 1. freePtr is the first free byte in data[]. freeSpace keeps track of how much free space is available in data[]. dummy is there for alignment purposes only. The nextPage and prevPage members are used to for a doubly linked list of the pages in the file. Finally curPage is the page number of the current page. As records are deleted, the resulting hole should be compacted. However, since records are addressed via a (pageNo, slotNo) pair called RID, we cannot compact the slot array (except in one case). RIDs have the following structure:

struct RID{
int pageNo;
int slotNo;
};
The pageNo identifies a physical page number (something that the buffer manager and the DB layers understand) in the file. The slotNo specifies a slot in the slot array on the page. The only tricky thing is that the slotNo field of an RID is always positive while the slot array is itself indexed using a negative number, so the sign has to be flipped when converting from one form to another. As an aside, in a real DBMS, RIDS usually have a third component that indicates the storage volume or file.

Description of Member Functions of the Page Class

The Page class has no constructor or destructor. Why is this ? Well recall that in the BufMgr class, the bufPool is defined to be an array of Pages. Whe the bufPool is first created, the Page class constructor would get called on each Page instance in bufPool. This is not when we want the "constructor" to be called. Instead we want it to be called when first allocate an empty page in the file (usually in the HeapFile::insertRecord method). To get around this problem, the Page class has no constructor. Rather an init() method is provided.

void init(const int pageNo)

Initializes the structure of a page. The pageNo parameter is used to set the curPage attribute.

void dumpPage()

Prints the contents of the data structures of the page (but not the contents of the records on the page). This might be useful for debugging.

const int getNextPage() const

Returns the value of the nextPage attribute. Note that the pageNo of the next page need not be consecutive to the curPage number. This is because the order of pages in the heap file is determined by the linked list of pages and not by the page numbers. The File class in the I/O layer might come up with any pageNo when allocating a new page. The page numbers are used merely as pointers.

const int getPrevPage() const

Returns the value of the prevPage attribute.

void setNextPage(const int pageNo)

Sets value of nextPage to pageNo.

void setPrevPage(const int pageNo)

Sets value of prevPage to pageNo.

const Status insertRecord(const Record & rec, RID & rid)

Inserts the record rec in the page. The data and length portions of the record are copied into appropriate postions in the page. Returns the RID of the record via the rid parameter. Returns OK if no errors occurred, NOSPACE if there is not sufficient room on the page to hold the record.

const Status deleteRecord(const RID & rid)

Deletes record with RID rid, compacting the hole created. Compacting the hole , in turn, requires that all the offsets (in the slot array) of all records after the hole be adjusted by the size of the hole. The slot array can be compacted only if the record corresponding to the last slot is being deleted.
Returns OK if no errors occurred, INVALIDSLOTNO if the rid is invalid (i.e. there is no such record in the page), NORECORDS if no more records are left in the page (this is not really an error condition).

const Status firstRecord(RID& firstRid) const

Returns the RID of the first record of the page (via the firstRid parameter). Remember that slot[0] (and possibly slot[-1], ...) might be empty due to deletions. Scans the slot array backward starting at 0, looking for a non-empty slot (an empty slot is indicated by a length field of -1).
Returns OK if no errors occurred, NORECORDS if the page was empty.

const Status nextRecord (const RID & curRid, RID& nextRid) const

Returns (via the nextRid parameter) the RID of the next record on the page after curRid. Returns OK if no errors occurred, ENDOFPAGE if curRid was the last record on the page.

const Status getRecord(const RID & rid, Record & rec)

Returns the length and a pointer to the record with RID rid. You can get recLen from the proper entry in the slot array. However recPtr needs to be an address and not just the offset from the beginning of data[]. This means that the upper layers can directly modify part of the page frame that contains the record. Returns OK if no errors occurred, INVALIDSLOTNO if the rid is invalid (i.e. if there is no such record on the page).

The HeapFile Class

The HeapFile class is used to implement heap files on top of the physical files provided by the DB class. Each heap file consists of a special header page plus one or more data pages. This header page is different from the DB header page for the file (you should not even be aware of the DB header page) and its contents are decided and manipulated by the HeapFile class. The HeapFile level header page has the following structure :

struct HeaderPage
{
char fileName[MAXNAMESIZE];    // name of file
int firstPage;                // pageNo of first data page in file
int lastPage;                 // pageNo of last data page in file
int pageCnt;                  // number of pages
int recCnt;                   // record count
};

The meaning of most of the fields in this structure should be obvious from their names. lastPage is used to keep track of the last page in the file. When someone tries to add a record to the file, this is the place likely to have space. When a HeapFile is instantiated, you should read the header page for the file into the buffer pool and keep it pinned until the file is closed. Rather than keeping track of whether it got updated while it was pinned, always unpin it with dirty = true. It is possible that the same file may be opened more than once at the same time. The HeapFile class need not be aware of this since this will simply result in the header page being pinned in the buffer pool multiple times and will not be flushed out until the last instance of the HeapFile has been closed. All this logic is handled by the BufMgr class. The firstPage attribute of the header page points to the first data page. The data pages are instances of the Page class and they are linked together using a doubly linked list (via the prevPage and nextPage members).

// class definition of HeapFile
class HeapFile {
protected:

File*       file;            // pointer to underlying DB File object
HeaderPage* headerPage;      // pointer to header page in buffer pool
int         headerPageNo;    // page number of header page

public:

// initialize
HeapFile(const string & name, Status& returnStatus);

// destructor
~HeapFile();

// return number of records in file
const int getRecCnt() const;

// insert record into file
const Status insertRecord(const Record & rec, RID& outRid);

// delete record from file
const Status deleteRecord(const RID & rid);
};

The private data members of the HeapFile class include file: a pointer to the File object that results from a call to DB::openFile(), headerPage: a pointer to the pinned header page of the file in the buffer pool, and headerPageNo: the page number of the header page of the HeapFile. The following are the descriptions of the public member functions.

HeapFile(const string & name, Status& returnStatus)

The actions performed by the constructor depend on whether or not the file already exists. If the file does not exist, the constructor first creates a file instance by invoking DB.createfile(name). Next the file is opened and the HeaderPage for the file created and initialized. To get space to hold the header page, do the following:

Page* tmpPage;
status = bufMgr->allocPage(file, headerPageNo, tmpPage)
// allocate a page to hold the header page
// check status for errors
headerPage = (HeaderPage *) tmpPage; // cast Page* to HeaderPage*

If the file already exists, the constructor simply reads the header page for the file into the buffer pool by executing the following:

// File already exists, get header page into buffer pool
Page* tmpPage;
status = file->getFirstPage(headerPageNo); // check status for errors
status = buf->readPage(file, headerPageNo, tmpPage);
// check status for errors
headerPage = (HeaderPage *)tmpPage;

You can distinguish between the two cases by simply trying to open the file. If the open fails, then the file does not exist and must be created ! Note that the constructor returns status information through the returnStatus parameter.

~HeapFile()

The destructor unpins the header page and calls DB::closeFile().

const int getRecCnt() const

Returns the number of records currently in the file.

const Status insertRecord(const Record & rec, RID& outRid)

Appends Record rec to the file. Uses the HeaderPage to locate the last page in the file which is then read into the buffer pool and pinned. Then the Page::insertRecord() method is invoked on the last page. In case the last page is full, allocates a new Page (and inserts it at the end of the doubly linked list of pages) before calling Page::insertRecord. You also need to handle the case when the file is empty. Be careful not to leave anything pinned when this function returns (except the header page). The RID of the record inserted is returned via the outRid parameter. Returns OK if no errors occurred or the error code of the first error that occurred.

const Status deleteRecord(const RID & rid)

Deletes the record with RID rid from the file by calling Page::deleteRecord on the appropriate page. The page must of course be first read in and pinned in the buffer pool. In addition if the deletion of the records results in an empty page, the page must be deallocated and removed from the doubly linked list of pages. Returns OK if no errors occurred or the error code of the first error that occurred.

The HeapFileScan Class

The HeapFile class defined above is used for basic operations like insertion and deletion of records. For retrieving records from the file we must use the HeapFileScan class (which also represents a HeapFile since it is derived from the HeapFile class). The HeapFileScan class can be used in three ways :

To retrieve all records from a HeapFile.
To retrieve only those records that match a specified predicate.
To retrieve a record with a given RID.

Several HeapFileScans may be instantiated on the same file, and this is ok. Each HeapFileScan will cause a separate pin on the header page in the buffer pool, but the Buffer Manager layer should be able to handle this transparently as long as all HeapFileScan objects are deleted properly.

class HeapFileScan : public HeapFile
{
public:

// initiate a filtered scan
HeapFileScan(const string & name,
               const int offset,
               const int length,
               const Datatype type,
               const char* filter,
               const Operator op,
               Status & status);

// end filtered scan
~HeapFileScan();

// read record from file, returning pointer and length
const Status getRecord(const RID & rid, Record & rec);

// marks current page of scan dirty
const Status markDirty(const RID & rid);

// return next record
const Status scanNext(RID& outRid);

// read record from file, returning pointer and length
const Status getRandomRecord(const RID &rid, Record & rec);

private:

RID   curRec;            // rid of last record returned
Page* curPage;           // pointer to pinned page in buffer pool
int   curPageNo;         // page number of pinned page
bool dirtyFlag;         // true if page has been updated
int   offset;            // byte offset of filter attribute
int   length;            // length of filter attribute
Datatype type;           // datatype of filter attribute
const char* filter;      // comparison value of filter
Operator op;             // comparison operator of filter

const bool matchRec(const Record & rec) const;
};

Private Data Members of HeapFileScan

curRec is the RID of the last record returned by the scan. curPage is a pointer to the buffer pool page containing record curRec and curPageNo is its page number. To make scans fast, we keep the current page pinned until we have processed all the records on the page. Then the page is unpinned and the next page is read into the buffer pool and pinned. dirtyFlag represents the update status of the page which must be supplied to the Buffer Manager when the page is unpinned.

The other fields have to do with conditional or filtered scans. length is the size of the field on which predicate is applied and offset is its position within the record (we consider fixed length attributes only). The type of the attribute can be STRING, INTEGER or FLOAT and is defined in heapFile.h Similarly all ordinary comparison operators (as defined in heapFile.h) have to be supported. The value to be compared against is stored in binary form in filter. For example, if we want to evaluate salary < 20,000 then 20,000 will be stored as a 4 byte array in filter.

HeapFileScan(const string & name, const int offset, const int length, const Datatype type, const char* filter, const Operator op, Status & status)

This method initiates both unconditional and filtered scans. If filter == NULL, an unconditional scan is performed. Otherwise, the data members of the object are initialized with the parameters to the constructor. In either case, the HeapFile with the appropriate name is opened by giving a call to the HeapFile constructor. The status of all these operations is indicated via the status parameter.

~HeapFileScan()

Shuts down the scan and unpins curPage.

const Status scanNext(RID& outRid)

Returns (via the outRid parameter) the RID of the next record that satisfies the scan predicate. The basic idea is to scan the file one page at a time. For each page, use the firstRecord() and nextRecord() methods of the Page class to get the rids of all the records on the page. Convert the rid to a pointer to the record data and invoke matchRec() to determine if record satisfies the filter associated with the scan. If so, return rid. To make things fast, keep the current page pinned until all the records on the page have been processed. Then continue with the next page in the file. Returns OK if no errors occurred. Otherwise, return the error code of the first error that occurred.

const Status getRecord(const RID & rid, Record & rec)

Returns (via the rec parameter) the record identified by rid. The page containing the record should already be pinned in the buffer pool (by a preceding scanNext() call). If not, return BADPAGENO. A pointer to the pinned page can be found in curPage. Just invoke Page::getRecord() on this page and return its status value.

const bool matchRec(const Record & rec) const

This private method determines whether the record rec satisfies the predicate associated with the scan. It takes care of attributes not being aligned properly. Returns true if the record satisfies the predicate, false otherwise. We will provide this method for you.

const Status markDirty(const RID & rid)

Marks the current page of the scan dirty by setting the dirtyFlag. The dirtyFlag should be reset every time a page is pinned. Make sure that when you unpin a page in the buffer pool, that you pass dirtyFlag as a parameter to the unpin call. Returns OK if no errors. The page containing the record should already be pinned in the buffer pool.

const Status getRandomRecord(const RID &rid, Record & rec)

Retrieves an arbitrary record from a file. If the specified record is not on the currently pinned page, the current page is unpinned and the required page is read into the buffer pool and pinned. The difference between this method and the getRecord() method above is that getRandomRecord() does not assume that the requested record is on the page currently pinned by the scan. Returns OK if no errors. Otherwise, the first error occurred.

Getting Started

Start by making a copy of the directory ~test594/project/part1 in your home directory. In it you will find the following files :

Makefile - A make file.
buf.h - Class definitions for the buffer manager. You should not change this.
buf.C - Implementation of the buffer manager. You should not change this.
db.h - Class definitions for the I/O layer. You should not change this.
db.C - Implementation of the I/O layer. You should not change this.
error.h - Error codes.
error.C - Error class implementation.
page.h - Definitions for the Page class.
page.C - Implementation of the Page class.
heapfile.h - Definitions for the HeapFile and HeapFileScan classes.
heapfile.C - Skeleton implementations for the above classes. You should complete these.
testpage.C - Test driver for the Page class.
testfile.C - Test driver for the HeapFile and HeapFileScan classes. You can augment this.

Error Checking

We have defined a class called Error in the files error.h and error.C. You can create an instance of this class (e.g. Error err;) and then print error messages from any method of any class by invoking err.print(status). As you develop new classes you should add new error codes and messages in the Error class. Be sure to check the return codes of each function that you call and make sure that all functions return some status. ALL error messages should be printed using the Error class.

Each function that you implement must do some error checking on its input parameters to make sure that they are valid. If they are not, then the function should return an appropriate error code. Each function should also check the return status of every function it calls. If an error occurred, you should just return this code to the higher level calling function. Remember to do any necessary cleanup while doing a premature exit from a function. Usually, you should not print out error messages in any of the functions you implement. These should only be printed by the top level calling function in testfile.C for example. The intermediate functions in the HeapFile layer should just relay the error codes to the higher layers.

Coding and Testing

We have defined this project in such a way that you can reap the full benefits of object-oriented programming using C++. Your coding style should continue this by having well-defined classes and clean interfaces. Reverting to the C (procedural) style of programming is not recommended. The code should be well-documented. Each file should start with the names of the team members and should explain the purpose of the file. Each function should be preceded by a few lines of comments describing the function and explaining the input and output parameters and return values.

Testing for correctness involves more than just seeing if a few test cases produce the correct output. There are certain types of errors (memory errors and memory leaks) that usually surface after the system has been running for a longer period of time, so you should run a few fairly robust tests. If you have programmed in Java you should keep in mind that C++ does not have automatic garbage collection, so each new must ultimately be matched by a corresponding delete. Otherwise all the memory in the system might be used up.

Handing In

As announced in class, please email it to Stefan (tzoompy@cs), with your group number in the subject line or in the text. (Don't know your group number? Check here.) If possible, please include the files as attachments to your message. Submit heapfile.C and, optionally, error.C and error.h. If you didn't make any changes to error.[Ch] then we'll just use the ones we provided. We will compile your heapfile manager, link it with our test driver, and test it. Since we are supposed to be able to test your code with any valid driver (not only testfile.C) it is very important to be faithful to the exact definitions of the interfaces as specified here.

GOOD LUCK AND HAVE FUN ! ! !