Project 3 Part B - Tagged File Support Design

Out: Monday, May 15
Due: Part B: Wednesday, May 31, 11:30 am

Overview

Studies show that most bytes travelling over the internet carry multimedia content. It's a reasonable guess that that most bytes on disk, worldwide, are multimedia bytes as well. Multimedia files have a number of characteristics that differ from what "traditional files" - for example, they're bigger, an important application (playback) involves reading them sequentially at rates that are unlikely to challenge the performance limits of the hardware, and they're typically tagged.

We're going to focus on the tags in this assignment (although other attributes are fair game as well if you want to consider them). To make things even more concrete, we'll think about the specific case of mp3 files and ID3V2 tags. It's likely, though, that your solutions would be agnostic about the kinds of tags, which would be a good thing: not only is there more than one flavor of mp3 tag, there are other completely different tags for audio files, as well as tag formats for still images and movies. More generally, one could think of tags as a very flexible notion of file meta-data, beyond the simple things (like last modification time, owner, etc.) provided by standard file systems. Indeed, some people are working on that.

Problem Overview

mp3 files contain two kinds of information: the encoded audio information, and tags. Tags provide information like the name of the artist, the name of the album, the name of the track, and the length of the track. One can easily imagine other information being kept as tags as well.

Both kinds of information are encoded in a single file. Here is a picture illustrating, at a very high level, this convention for ID3V2 tags (taken from www.id3.org):

Because (in a typical system) a file is just an array of bytes, this means some convention must be followed by the writer of an mp3 file and the reader about which bytes are the tag and which are the encoded audio - that is, there is no "read tags" system call, or anything of that sort. The convention provides the specific information required to interpret the tag data: how to tell where a field (a blue entry in the image) starts and ends, what type of field it is (e.g., artist or track), and how to interpret the data it carries (e.g., artist name as an zero-terminated ASCII string or track number as an unsigned byte).

So far so good. But, most everything about this situation is at odds with the rest of the system, including the underlying file system, which know nothing about the tag convention. Here are two specific examples:

Applications that deal with mp3's often deal with the file naming issue in the following way. First, the user points the application at some directory. The application then recursively traverses the subtree rooted at that directory, extracting tag information from all mp3 files it encounters. That information is put in a database. The interface to the user allows specifying files by tag information, and the application uses the database to lookup the file name(s) that correspond to what the user has specified. That is, the database effectively replaces the traditional use of directory trees as the primary concept behind naming files.

There is no good solution at the moment for the second problem. Fixing that is a primary goal of the design exercise.

The Problem

Design support that would make manipulating file tag information as convenient and flexible as possible.

This support might come at one or more of a number of levels:

Factors to Consider

These are things to think about. Not all can be met at once.

What To Hand In

A report that summarizes your understanding of the issues and the strengths and weaknesses of various approaches you considered. While you should discuss approaches you rejected (and explain why you rejected them), in the end you must come to a specific recommendation. Make sure that recommendation is easily identifiable to someone reading the report.

Hand in the report as hardcopy, in class on the due date.