CSE 303, Autumn 2008, Assignment 4

Due: Sunday, November 9 at 11 pm

Assignment goal

In this assignment, you will develop a more complex program using dynamic data structures. In doing so you will:

Synopsis

In this assignment, you will build programs to implement T9 predictive text, a text input mode available on cellphones. On a cellphone, each number from 2-9 on the keypad represent three or four letters, the number 0 represents a space, and 1 represents a set of symbols such as { , . ! ? } etc. The numbers from 2-9 represent letters as follows:

2 ABC
3 DEF
4 GHI
5 JKL
6 MNO
7 PQRS
8 TUV
9 WXYZ

Since multiple letters map to a single number, many key sequences represent multiple words. For example, the input 2665 represents "book" and "cool", among other possibilities.

To translate from number sequences to words, we will use a data structure known as a trie. A trie is a multiway branching structure (tree) that stores the prefixes of sequences. As we travel down a path in the trie, we reach word sequences spelled out by the numbers along that path. Classic trie data structures have edges labeled with letters to store prefixes of strings. But for this application, we used a compressed trie that has only 10 possible branches at each node instead of 26, since the digits 0-9 represent the 26 letters, space and symbols. Because of this, an extra layer of complexity is needed to figure out the string represented by a path, which is the predictive component of T9.

For more information on the data structure trie, here is a link to the Wikipedia article. Here is a link to the powerpoint presentation given in class (pptx 2007 format; a ppt version and a pdf version are also available (but are much larger)).

Technical Requirements

Implement in C a program t9. The command

        t9 FILE

should read in a dictionary file (FILE) that contains a list of words. Translate each word in the dictionary into its numeric key sequence, then add the key sequence to your trie, with the word at the end of the path corresponding to the digits. If a word with the same numeric sequence already exists in the trie, add the new word to the trie as a link to a new node with an edge labeled '#' instead of one of the digits 2-9. (The words linked from a node by the '#' edges essentially form a linked list of words that have the same numeric code.)

For example, if the program reads the set of words "jello","rocks", and "socks" from the dictionary and adds it to an empty trie, the resulting trie should look like this:

Once your program has read the dictionary and built the trie containing the words in it, it should start an interactive session. The user should be able to type numbers and the program should print out the word corresponding to the sequence of numbers entered. Your program should use the numbers typed by the user to traverse the trie that has already been created, retrieve the word, and print it to the screen. If the user then enters '#', the program should print the next word in the trie that has the same numeric value, and so forth.

As an example, if we run the program using the above trie, an interactive session might look like this:

Enter "exit" to quit.
Enter Key Sequence (or "#" for next word):
> 76257
   'rocks'
Enter Key Sequence (or "#" for next word):
> #
   'socks'
Enter Key Sequence (or "#" for next word):
> 53556
   'jello'
Enter Key Sequence (or "#" for next word):
> #
    There are no more T9onyms

Enter Key Sequence (or "#" for next word):
> 76257#
   'socks'
Enter Key Sequence (or "#" for next word):
> 76257##
	There are no more T9onyms
>4423
	Not found in current dictionary.
>exit

Note: Make sure your program properly handles the case if the user typing in more "#"s than there T9onyms!

We provide you with two text files, smallDictionary.txt and dictionary.txt (right-click on the links to download the files). Each of these text files contains a list of words to be used in constructing a trie - the small one primarily for testing, and the large one for the final program. Translate each word in the file into its associated T9 key sequence, and add the word to the trie. In the case of multiple words having the same key sequence k, let the first word encountered in the text file be represented by the key sequence k, the next encountered represented by k#, the next k##, etc. For example, 2273 can represent acre, bard, bare, base, cape, card, care, or case. To disambiguate, acre would be represented by 2273, bard by 2273# bare by 2273##, and so forth. When a user inputs a key sequence, print the appropriate word or words.

Your trie data structure should contain nodes to represent the tree, and strings (char arrays) containing copies of the words read from the dictionary file, linked to appropriate nodes in the trie.

Besides the general specification given above, your program should meet the following requirements to receive full credit.

  1. You should create a Makefile and use make to compile your program. Your Makefile should only recompile the necessary part(s) of the program after changes are made.
  2. Use malloc to dynamically allocate the nodes, strings, and any other data that make up your trie.
  3. If you need to create a copy of a string or other variable-size data, you should dynamically allocate an appropriate amount of storage using malloc and return the storage with free when you are done with it. The amount allocated should be based on the actual size needed, not some arbitrary size that is assumed to be "large enough".
  4. Use standard C library functions where possible; do not reimplement operations available in the basic libraries.
  5. If an error occurs when opening or reading a file, the program should write an appropriate error message to stderr and terminate if there is no further work to be done.
  6. Before the program terminates, all dynamically allocated data must be properly freed (i.e., free everything acquired with malloc).
  7. Your code must compile and run without errors or warnings when compiled with gcc -Wall on attu. Your program should build without errors when make is used on attu to run your Makefile..

Code Quality Requirements

As with any program you write, your code should be readable and understandable to anyone who knows C. In particular, for full credit your code must observe the following requirements.

  1. Divide your program into suitable source files (at least two) and functions, each of which does a single well- defined aspect of the assignment. For example, there should almost certainly be a header and source file for the trie data structure and the operations needed on it (create a new empty trie, insert a word, search, free (delete) the trie, etc.). Your program most definitely may not consist of one huge main function that does everything.
  2. The header (.h) file for the trie (and any other header files) should only declare items that are shared between client programs that use the header and the file(s) that implement it. Don't include in the header file implementation details that should be hidden. Be sure to use the standard #ifndef preprocessor trick so your header files work properly if included more than once in a source file, either directly or indirectly.
  3. Be sure to include appropriate function prototypes near the beginning of each source file for functions whose declarations are not included in a header file.
  4. Comment sensibly, but not excessively. You should not use comments to repeat the obvious or explain how the C language works - assume that the reader knows C at least as well as you do. Your code should, however, include the following minimum comments:
  5. Use appropriate names for variables and functions: nouns or noun phrases suggesting the contents of variables or the results of value-returning functions; verbs or verb phrases for void functions that perform an action without returning a value. Variables of local significance like loop counters or indices should be given simple names like i or n and do not require further comments. Avoid names like count or flag that provide no useful information - use names that suggests the values being counted or the condition that is represented.
  6. No global variables. Use parameters (particularly pointers) appropriately. Exception: if you wish, you may have global variables that record the settings of any command-line options added for the extra credit part (if you create any of these).
  7. No unnecessary computation or excessive use of malloc or free - these are expensive. Don't make unnecessary copies of large data structures; use pointers. (Copies of ints, pointers, and similar things are cheap; copies of large arrays and structs are expensive.)

Implementation Hints

  1. There are a lot of things to get right here; the job may seem overwhelming if you try to do it all at once. But if you break it into small tasks, each one of which can be done individually by itself, it should be quite manageable. For instance, figure out how to add a single word to the trie before you implement the logic to process all the words in the dictionary. Figure out how to add a few words that have different numeric codes before you implement the code to handle words that have the same codes. Implement the code to traverse the trie to translate an input key sequence into the corresponding word once you've built the trie, not before.
  2. Before you start typing code, spend some time sketching out data structures and code (particularly tree node structs) on paper or on a whiteboard. Be sure you understand what you are trying to do before you start typing.
  3. Every time you add something new to your code (see hint #1), test it. Right Now! It is much easier to find and fix problems if you can isolate the potential bug to a small section of code you just added or changed. printf and gdb are your friends here to examine values while debugging.
  4. You will probably find it very useful to include code that can print the contents of the trie in some understandable format. This is not required, but how can you be sure your code is correct if you can't look at the trie that is built for a small set of input data?
  5. To build the trie, you need some way to translate characters (primarily letters) from words in the dictionary file to the correspoinding keypad digits. It is probably a great idea to include in your code a function that takes a character as an argument and returns the corresponding digit. This can be implemented with a series of if-elseif-else statements, but another way to do it is to have an array with one entry for each possible character. In the entry for each character, store the corresponding digit code. Then you can look up the code for a character without a complicated nest of if statements. (Recall that in C, as in similar languages, a character can be used as a small integer. That is particularly helpful for exactly this kind of application, where we might like to use a character as an index value into a table of data.)
  6. Be sure to check for errors like trying to open a nonexistent file to see if your error handling is working properly.
  7. Once you're done, read the instructions again to see if you overlooked anything.

Extra Credit

A small amount of extra credit will be awarded for adding the following extensions to an already complete, working assignment. No extra credit will be awarded if the basic program is not fully implemented and substantially bug-free.

If you include extensions in your program, you should also include a README file that describes what you added and how to demonstrate your addition(s) when your program is executed.

Test Sequences:

The sequences below can be used to validate your trie against the given dictionary.

  1. 22737: acres, bards, barer,bares,barfs,baser,bases,caper,capes,cards,carer,cares,cases
  2. 46637: goner,goods,goofs,homer,homes,honer,hones,hoods,hoofs,inner
  3. 2273: acre, bard,bare,barf,base,cape,card,care,case
  4. 729: paw,pax,pay,raw,rax,ray,saw,sax,say
  5. 76737: popes,pores,poser,poses,roper,roses,sords,sorer,sores

What to turn in

Create an uncompressed tar file containing your source code, Makefile, and (if you have it) your README file describing your extensions and submit that using the normal class dropbox.