Project 2 - Concurrency, IPC, and Synchronization

Version 1.2 (1/25 - length 0 files do NOT match other length 0 files)
Version 1.1 (1/25 - md5lib.a now available)
Out: Thursday January 24th, 2002
Due: Wednesday February 6th, 2002
(Small extra credit for turn-in by class time Monday, February 4th)

Assignment Goals

Background

There has been a great deal of interest in the research community recently in the amount of duplication of data: for example, the same file may be stored on multiple machines, the same data may be sent across networks to multiple recipients or multiple times to the same recipient, or a single machine may store multiple independent copies of the same data.  Functionally, this assignment is about the last of these: how common is it for distinct files on a single machine to contain identical data?  (Educationally, this assignment is about the points listed under Goals.)

The high-level task is to write a C or C++ program that  walks over an entire file sub-tree, starting with the current working directory, and determines which files (if any) are duplicates of each other.  To do this, you'll need to do a pair-wise comparison of all the files you encounter.  Sounds awful?  There's a trick.

If you do a simple pair-wise comparison of N files of length L, the time required is O(N2L).  In practice, the running time can be greatly speeded up by using "cryptographic hashes" of the file contents.  Here is a  quote from rfc1321 about MD5, the cryptographic hash function we will be using:

"The [MD5] algorithm takes as input a message of arbitrary length and produces as output a 128-bit "fingerprint" or "message digest" of the input. It is conjectured that it is computationally infeasible to produce two messages having the same message digest, or to produce any message having a given prespecified target message digest. The MD5 algorithm is intended for digital signature applications, where a large file must be "compressed" in a secure manner before being encrypted with a private (secret) key under a public-key cryptosystem such as RSA."

What this means to us is that the odds that two files with different contents will hash to the same value is vanishingly small, and so comparison of file hash values is (probabilistically) enough to determine whether or not their contents are the same.  We use that to speed up the file comparison process.

Meta-Code

Here's what your program needs to do:

  1. Create two data structures, one holding records of "files to be visited" and one of "files already visited."
  2. Initialize them: to be visited is initialized to contain a record indicating the current working directory.  already visited is initialized to empty.
  3. while to be visited is not empty, loop over steps 4-8
  4. get the next to be visited record
  5. determine if the file is a directory, a symbolic link, or a data file (Hint: man lstat)
  6. if it's a symbolic link, ignore it
  7. if it's a directory, enumerate the files it contains (Hint: man opendir) .  For all files other than "." and "..", place records about them in the to be visited data structure
  8. if it's a non-empty data file, compute the MD5 hash of the file contents and check for equality against the hashes of all files in the already visited data structure.  Add the just visited file to already visited.
  9. print out statistics on what you found

Computing MD5 Hashes

You could write code to compute MD5 hashes, or you could make use of an already written, already debugged implementation written by someone else.  For many reasons, including that it's part of this assignment, you've chosen the latter. 

A (Linux/x86) executable that provides a command line interface is available on spinlock/coredump at /cse451/zahorjan/public/md5/md5 .  When invoked with with a single argument that is a file name, it prints out the MD5 hash of the file contents:

spinlock> md5 md5c.c
277c511adf66ff8c4f3b8a0447e945fd

That long hex string is the hash.

To use this version (the binary), you should connect a pipe between a fork'ed invocation of the md5 code and your program.  (Note that this is not what happens in the code shown in class and on the course web, which connects two fork'ed processes together.  You can modify that code to do what you need here, though.)  Once the pipe is created and connected, and the md5 process fork'ed, you can just read the hash out of the pipe.

Implementation Step 1 - Single-Threaded using Exec/Pipes

You will not hand in this implementation or the results of running it, so in that sense it's optional.  However, there are enough new  things going on in this assignment that my advice is to get a single-threaded version working first, and to move on from there.  This version is just a C/C++ application.  (There are no kernel modifications required by any part of this assignment.)  Think carefully about the data structures you will use; bad choices can lead to very long running times.

Requirement: To be considered working, your program must (a) work, and (b) always work.  By the latter, I mean that you need to handle unexpected cases. For instance, a file that you put on the to be visited list might be deleted before you get around to trying to compute its md5 hash.  Don't die in that case, do the most sensible thing you can.  (Ditto for other exigencies.)  This requirement applies to all implementation steps.

Implementation Step 2 - Multi-Threaded using Exec/Pipes

The Step 1 program will be... let's just say "slow."  The assignment has been designed to try to make IO a significant component of the execution, so that may be an exploitable problem.  Because your program is single-threaded, and IO is synchronous, no progress is made when your program needs to perform physical disk IO.   To address this, we'll multi-thread the implementation.

Gee, that's the entire description of this step: implement a multi-threaded version of this program (using pthreads).  Your program should accept a single command line argument indicating the number of  threads that should work on the problem at once.  (The outputs it should produce are defined by the information requested in the "What to Hand In" section.)

Your main objective in this part is to design/implement a sensible multi-threaded version of this task.  As a secondary issue, you should use locking and other synchronization "judiciously" - imagine that you were running on an actual multiprocessor, and wanted to keep lock contention low.  (Because you won't be running on a multiprocessor, it will be hard to see much effect of your decisions about things like locking granularity, but you should think about the issues in any case.)

Requirement: Your implementation needs to establish a fixed, maximum number of entries that can be in the to be visited data structure at any time (and  that limit must be significant less than "as many as can fit in RAM").  WARNING WARNING WARNING - this requirement potentially introduces odd failure situations.  That's intentional. 

Implementation Step 3 - Multi-Threaded using Library Calls

One might reasonably suspect that fork'ing a new process to compute the hash of each file consumes a lot of resources.  Your program might run faster if you instead incorporated the md5 program into your own, and invoked it by procedure call rather than fork.

A (threadsafe) library version of the same md5 code is at /cse451/zahorjan/public/md5.a.  There is only a single exported routine:

int MDFile(char* filename, char hashValue[33]);

The return value is 1 if the hash has been produced, and 0 if an error was encountered. The first argument is the name of the file you want to hash.  The second argument is a buffer in which the hash will be placed (as a null terminated string).

There is no .h file for this library; just insert the signature above where needed.

What to Turn In

This material is due in class on February 4.

A writeup consisting of five things:

  1. A discussion of issues and design decisions you made in creating the multi-threaded program for Implementation Step 2.  What data structures did you use, and why?  What significant, structural changes from the Meta Code were required? What synchronization was required, and why? What did you do to reduce synchronization costs (e.g., the number of synchronization calls needed and/or the likelihood of blocking on a call)?  What did you do about the peculiar requirement in Step 2, and WARNING WARNING WARNING?  Etc...
  2. A (hardcopy) listing of the Step 2 version of your program.
  3. Some quantitative results that test functionality.  Run either multi-threaded version on a VMware machine, starting in the root directory ("/"), using 4 threads.  Report (a) the total number of data files examined, (b) the total number of directories examined, (c) the number of matches found (a single match is all files that are identical; 5 identical files counts as one match, except that files with no contents (i.e., of length zero) are not counted), (d) the fraction of all data file bytes that are duplicated in other files (total bytes in all files with a duplicate / total data file bytes), and (e) the total number of data file bytes examined (the denominator in the previous expression).
  4. Some quantitative results that test the effectiveness of the changes made in Steps 2 and 3.  Run both implementations on an otherwise idle machine (e.g., a VMware machine) starting in any directory you want.  Vary the number of worker threads used from 1 to 10.  Provide graphs that show (a) the elapsed time for each run (Hint: man time) and (b) the CPU utilization for each run.  (Make sure the directory you start in presents enough work that the timing routine you use can get meaningful results.)
  5. A short write-up that analyses these results.  For Step 2, are threads effective (e.g., do they reduce elapsed time and/or total CPU time consumed)?  If so, why?  If not, why not?  How do the Step 3 results compare with the Step 2 results?  What does this mean about where time is being spent (e.g., are you IO bound, CPU bound, synchronization bound)?

We reserve the right to request that you send us a copy of the code turned in as item 2.  Make sure you keep this code available, in case we ask.

Education

You've probably noticed that this assignment is "vaguer" than many you've seen.  That is intentional - it will take some work to figure things out, and that process is part of this assignment.

To avoid subverting this intention, some standard practices may need to be slightly modified.  For instance, please be very reluctant to use the mailing list to get help with "odd problems."  Odd problems will abound.  For example, when testing your program you might see an "out of resources" message as your program suddenly dies.  (I did.)  What's it talking about?  Where'd it come from?  What resource?

If you run into a problem that you have no clue about, don't let asking others be your first step in trying to solve it.  Instead, make a valiant attempt to figure out what the might be, using what you know and the non-human materials available to you. I believe that you know enough to form a pretty good hypothesis about any problem you might run into, but doing so definitely takes some effort.  If this doesn't let you solve the problem, next contact the course staff.  We'll try to lead you through the reasoning about your problem, filling in any gaps as best we can, and hopefully resolve it.  Finally, if that doesn't work, we'll use the mailing list (e.g., we won't have as much experience as you using the particular VMware setup in the labs).

That was just an example.  The thing to remember, though, is that going through the process of encountering problems, thinking about how to fix them, and then trying to fix them, is an explicit, intended benefit of this assignment.  Don't try to short-circuit this process by asking for help too soon, or by providing help in a way that just gives an answer but doesn't teach anything (in situations where there's something to be taught).

Rah Rah

This might strike you as difficult, either now (e.g., because half the terms here don't mean anything to you) or as you're working on it.  Don't despair!  The code itself is not terribly large.  (My admittedly partial implementation of Step 2 is about 100 lines of C.)  You have 11 days, with a midterm in the middle.  The work is intended to fit in this time, even allowing 48 hours during which you don't work on it due to the exam (and then only a normal amount of work on it per day for the remaining time).  But, START EARLY!  This assignment may be qualitatively much different than what you're used to.  You'll spend less time total by starting now than by trying to rush through it as the deadline draws closer.