CSE 451, Introduction to Operating Systems, Spring 2012

Project 3 - Undelete

Assigned: Friday May 18
Due: Saturday June 2 at 11:59 p.m. (electronic submission)

Project introduction

This assignment aims to recover from the following common situation: you've just finished writing an essay/report/paper due in the next hour, but before you print it you accidentally manage to delete it. It would be very convenient to have a program for such situations capable of undeleting files, and that, in fact, is the goal of this project.

Implementations of undelete are very file system-type specific. We're going to write undelete for the ext2 file system, which was released close to 20 years ago now. The reason is that the file system layout in ext2 is very similar to that of its successors, ext3 and ext4, so it's pertinent to today's systems even if no one is using ext2 any longer. Undelete is not possible for ext3 and ext4 without looking through the filesystem journal, which is a more complicated task than we would like to assign to you.

This project leaves most everything up to you. The coding portion is relatively small; figuring out what to do is about half the assignment and getting it implemented is the other half.

Project plan

After reading over the following description of the project and browsing through the linked ext2 resources, we invite you and your fellow group members to come up with a plan for implementing your project. As "simple" as ext2 is, it's still a complex filesystem, so there is a lot of room for poorly-though-out code to go very far astray. If you submit a plan for your project to the TAs via email by the start of class on Wednesday 5/23, they will review it and give you feedback, which will hopefully save you from some major headaches close to the deadline. Furthermore in writing a plan you will be forced to learn the specifics of ext2, which is a great way to familiarize yourself with it prior to doing much or any coding.

Project files

To start the project, you'll need to download a set of files that I have prepared. This includes:

ext2fs.h: Include this file in your .c file for common ext2 definitions (it includes the ext2_types.h header). You should rely on the macros provided here where possible instead of doing arithmetic related to sizes and offsets yourself in most cases.
fileIOExample.c: This is a simple example of doing file IO in C, in case you are unfamiliar with how it works.
filesysFile-simple and filesysFile-easy: These are two sample ext2 filesystems on which you can start testing your code. You will definitely need to test more complex filesystems than these.
mkFilesysFile.sh: This is a script for creating and mounting ext2 filesystems. It takes two parameters: a file name and the number of 1kiB blocks to use. You'll need to specify at least 60 blocks for it to succeed. Furthermore, you will need root access to mount a filesystem, so you'll need to use either a personal machine or the course virtual machine image.

Your task

You will be creating a program to recover deleted files in ext2 filesystems. In ext2, when a file is deleted, the file's inode and data blocks are marked as free, its directory entry is invalidated, the deleted time of the file's inode is set, and the number of hardlinks in the inode is set to 0. The key thing here, though, is that the file's inode and data blocks are still preserved, at least until the filesystem ends up reclaiming them. You will write a utility in C called undelete that is capable of recovering these files as follows:

Your undelete executable should take the name of a filesystem file as an argument, e.g. ./undelete [ext2fs-filename].
For every deleted file in the specified filesystem, write the file to recovered_files/[ext2fs-filename]/ relative to the directory from which undelete was invoked (i.e. as a file in the real filesystem and not the one you're processing).
Name each file file-nnnnn, where nnnnn is the inode number of the deleted file.
Restore each file's access time, modified time, and contents. For example, suppose the file sample.txt has inode number 12 and has been deleted from sample-filesystem. Running the undelete program on sample-filesystem should create a file recovered_files/sample-filesystem/file-00012 with the access and modification timestamps of sample.txt and the same contents. See man 3 futimes for help with changing access and modification times.
For deleted inodes that do not correspond to files, take no action.
Restore only files that have been deleted. You will need to make use of the inode bitmap for this.
There is a possibility that part of a deleted file has been reclaimed for use by something else. The goal of this project is not to build an extremely robust file recovery solution, so you don't need to worry about handling this gracefully.
You will need to handle multiple level of data block indirection as well as multiple block groups. The provided filesystems do not exercise these features of ext2 at all, so it is up to you to create sample filesystems that do.
Do not seek unnecessarily through the filesystem file. In particular, you should not need to visit every single inode if there are many non-deleted files and only one deleted file.
You may not keep an unbounded number of inodes and data blocks in memory at once (or any structure for that matter). In particular, you should be able to restore a 2 GiB filesystem with about 2 GiB worth of deleted files while using the same (small) amount of memory as if you were restoring a 10 MiB filesystem. You have some flexibility here; just make sure that the number of inodes and data blocks kept in memory is small relative to the size of the filesystems.
You may not steal code from existing file recovery tools. Most build on existing frameworks, which isn't going to help you anyway, so just don't do it.

Using the filesystem creation tool

The filesystem creation tool provided with the starter code above is fairly simply, and you are free to customize it as you like. Here is what it contains:

#!/bin/bash

if [ $# -ne 2 ]; then
    echo "Usage: $0 [filename] [size-in-1kB-blocks]"
    exit 1
fi

dd if=/dev/zero bs=1024 count=$2 of=$1 && mkfs.ext2 -F -b 1024 -c $1 && \
  tune2fs -c0 -i0 $1 && mkdir -p mnt && sudo mount -o loop $1 mnt

echo

if [ $? -eq 0 ]; then
    echo "Mount completed successfully! Use 'sudo umount mnt' to unmount"
else
    echo "Mount failed...please try again."
fi

The dd command copies data from a device (in this case /dev/zero, which outputs 0s) to an output file using a given number of blocks of a certain size.
The mkfs.ext2 command creates an ext2 filesystem in the given file with blocks of the given size (1024 bytes).
The tune2fs command prevents the OS from running fsck on the filesystem automatically, since that could modify the filesystem in ways you don't expect.
The mkdir command creates a mount point for the filesystem.
The mount command mounts the filesystem using mnt as a mount point. Use this command by itself to mount an already-created filesystem.

mount requires root privilege, so you will only be able to run this on your own machine or on a VM image.

When you create a filesystem and mount it, you will not necessarily see the changes to the filesystem file until you unmount the filesystem. For example, let's say that you create and mount a filesystem file named test. You then create a file on it through echo "Some file contents" > mnt/sample.txt and delete the file with rm mnt/sample.txt. Before you run your undelete tool to try to recover the file, first run sudo umount mnt to flush the changes to disk. To mount the filesystem again, use sudo mount -o loop test mnt. If you don't unmount the filesystem, you won't necessarily see the changes reflected in it.

Debugging

An important part of this project is debugging. You're going to run into off-by-one errors--it's pretty much guaranteed, given that block and inode numbering is kind of funny. If it doesn't seem like you're reading the right data from the filesystem file for the block or inode number you are using, there are a couple different tools to help you. First of all, make liberal use of debug printing. To make a switch for enabling or disabling printing of debug statements, add some code like this near the top of your source file:

#define CSE451_DEBUG
#ifdef CSE451_DEBUG
#define LOG_PRINTF(...) printf(__VA_ARGS__)
#else
#define LOG_PRINTF(...)
#endif

Any macro name is fine, really, as long as you don't override any existing or commonly-used macros. Use LOG_PRINTF just as you would printf. To disable debug output, simply remove the #define CSE451_DEBUG line.

After setting up debug printing through something like the above, when you go to read something from disk, print out its absolute offset within the file as a hexadecimal value (e.g. 0x4000). If what you read at that point doesn't match your expectations, open the filesystem file using a hexeditor such as hexedit. Look at the data around that location. Was the data close by? Try to identify whether you were off by one index or whether you were looking in the wrong place entirely.

There is a tool called dumpe2fs that will be very useful to you as well (see man dumpe2fs. It prints a variety of information about offsets, which inodes are free, etc. and you can (and should) use this for figuring out what the expected values at various points in your undelete utility are.

When you want to test recovery of large files, try using image files. If you are able to recover a deleted image file, it should appear as before, and you should quickly be able to tell whether any of the data was corrupted.

ext2 references

There is a wealth of information about the layout of ext2 filesystems online. In order of helpfulness, I would recommend reading the following pages:

http://www.nongnu.org/ext2-doc/ext2.html: If you were going to use only one reference for this project, it would be this one. It covers all attributes of ext2 that you need to know.
http://www.tldp.org/LDP/tlk/fs/filesystem.html: A description of ext2 from a book about the Linux kernel from quite some time ago.
http://homepage.smc.edu/morgan_david/cs40/analyze-ext2.htm: Somewhat less technical than the above two sources and walks through some helpful examples.

Turnin procedure

Submit the following files using the turnin tool under a directory named [group-name]-project3, as you have with the other projects:

Makefile. Running make should built an executable named undelete in the same directory.
.c and .h source files, including the provided header files.
A README file in which you list:
1. Your group name (e.g. cse451x).
2. The names and CSE Net IDs of you and your group members.
3. One or two paragraphs describing the design of your undelete tool and the testing that you performed.