Homework: T9 Implementation

Due: Tuesday, November 12, 2024, at 11:59pm

Jump To:

Goals
Synopsis
Set-Up
Background Information
Provided Files
Your Task
Assessment
Turn-in instructions

Assignment Goal

In this assignment, you will develop a more complex program using dynamic data structures. In doing so you will:

Synopsis

In this assignment, you will build programs to implement T9 predictive text, a text input mode developed originally for cell phones and still used for numeric keypads. Each number from 2-9 on the keypad represents three or four letters, the number 0 represents a space, and 1 represents a set of symbols such as { , . ! ? } etc. The numbers from 2-9 represent letters as follows:

2 => ABC
3 => DEF
4 => GHI
5 => JKL
6 => MNO
7 => PQRS
8 => TUV
9 => WXYZ

Since multiple letters map to a single number, many key sequences represent multiple words. For example, the input 2665 represents "book" and "cool", among other possibilities.

          2       6       6       5              2       6       6       5
        a B c   m n O   m n O   j K l    OR    a b C   m n O   m n O   j k L
	  |         |       |     |                |       |       |       |
	   -----------------------                  -----------------------
	             |                                         |
	            book                                      cool

To translate from number sequences to words, we will use a data structure known as a trie. A trie is a multiway branching structure (tree) that stores the prefixes of sequences. As we travel down a path in the trie, we reach word sequences spelled out by the numbers along that path. Classic trie data structures have edges labeled with letters to store prefixes of strings. But for this application, we use a compressed trie that has only 10 possible branches at each node instead of 26, since the digits 0-9 represent the 26 letters, space and symbols. Because of this, an extra layer of complexity is needed to figure out the string represented by a path.

(Actually, for our application, each node only needs 8 possible children numbered 2-9, since digits 0 and 1 don't encode letters. But writing the code might be easier if nodes have 10 children numbered 0-9, since then subtree number n corresponds to digit n. Feel free to use either representation for the trie depending on which seems simpler to implement.)

For more information on the trie data structure, here is a link to the Wikipedia article.

This project should be done independently. If you work with a classmate, make sure you are each editing and working on your own set of files. You should not copy and paste code.

Set-Up

Before you get started, ensure that your set-up is up-to-date and appropriate.

  1. You should do this assignment using cancun, including gcc
  2. Ensure that your repository is up-to-date and committed. (You can use git status to determine if there are outstanding changes, and git add and git commit if you are unsure.)
  3. Use git pull upstream main to pull the newest commit from the upstream repository. This will give you access to the hw6 folder containing the materials for this assignment. (upstream specifies that you want to pull from the upstream repository, and main specifies the branch. You may see a text editor open to allow you to edit the merge message. This will likely be Vim, so you can edit it and then save, or just accept the current text and exit using :q!.)

You will do this assignment in your new hw6 folder. You should get into the habit of committing and pushing code frequently.

Background Information

Test-Driven Development

One common style of software engineering is called “Test-Driven Development” (TDD). With this workflow, you write tests for the intended behavior before writing the code which implements that behavior. In the previous assignment you have written a complete set of tests for a T9 implemenation. In this assignment you will be doing the development portion of TDD.

Provided Files

Your hw6 folder contains the following files:

Your Task

Implement the data structure that supports the T9 demo, as described in the t9_lib.h file. This new library will be run with the provided j9_demo.c to allow for an interactive session.

Your code will implement a prefix tree, such that some nodes contain completed words.

For example, if the program reads the set of words "jello","rocks", and "socks" from the dictionary and adds it to an empty trie, the resulting trie should look like this:

We provide you with two text files, smallDictionary.txt and dictionary.txt. Each of these text files contains a list of words to be used in constructing a trie. You may additionally use your own dictionary files to test.

Translate each word in the file into its associated T9 key sequence, and add the word to the trie. In the case of multiple words having the same key sequence k, let the first word encountered in the text file be represented by the key sequence k, the next encountered represented by k#, the next k##, etc. For example, 2273 can represent acre, bard, bare, base, cape, card, care, or case. To disambiguate, acre would be represented by 2273, bard by 2273#, bare by 2273##, and so forth.

For the interactive session, use the user-entered pattern to traverse the trie and return the correct word.

Your trie data structure should contain nodes to represent the tree, and strings (char arrays) containing copies of the words read from the dictionary file, linked to appropriate nodes in the trie.

Besides the general specification given above, your program should meet the following requirements to receive full credit.

  1. You should create a Makefile and use make to compile your program. Your Makefile should only recompile the necessary part(s) of the program after changes are made. Your Makefile should include a target clean to remove previously generated files. You may also want to include a target test for your own use during development.
  2. Use malloc to dynamically allocate the nodes, strings, and any other data that make up your trie. Make sure you free all the memory at the end of the run. We will use valgrind to evaluate how well your code manages memory.
  3. If you need to create a copy of a string or other variable-size data, you should dynamically allocate an appropriate amount of storage using malloc and release the storage with free when you are done with it. The amount allocated should be based on the actual size needed, not some arbitrary size that is assumed to be "large enough".
  4. One exception to this is that you may assume that no word exceed 50 characters in length. While you should not create trie nodes with strings of the default length of 50, you may use this value while reading from the input file.
  5. Use standard C library functions where possible; do not reimplement operations available in the standard libraries.
  6. Use buffer safe varitions of functions, such as strncpy instead of strcpy
  7. You must check the return status (result code) of every library function you call to be sure that no errors occurred. In particular, malloc will return NULL if it is unable to allocate storage. Although this is extremely unlikely to happen, a robust program must check for the possibility and react appropriately if it does.
  8. If an error occurs when opening or reading a file, or due to a failed function call, the program should write an appropriate error message to stderr and terminate if necessary.
  9. Before the program terminates, all dynamically allocated data must be properly freed (i.e., free everything acquired with malloc). This should be done explicitly without relying on the operating system to clean up after the program finishes. Remember to close any files you have opened.
  10. Your code must compile and run without errors or warnings when compiled with gcc -Wall -std=c11 on Cancun. Your program should build without errors when make is used to run your Makefile. You are, of course, free to use other systems for development, and you should be fine as long as you have a relatively recent version of gcc. But the code must work on the standard linux distribution we use for the course.
  11. Your program should terminate cleanly with no memory leaks or other memory errors reported when it is run using valgrind. You must load a dictionary and have a short interactive session while testing with valgrind. (Warning: valgrind slows down execution considerably. It will take several minutes to load the full dictionary.txt file and then free the resulting tree under valgrind. We suggest you use smaller input files during development to test for memory problems with valgrind.) If memory leaks are detected, compiling with the -g option and using valgrind's --leak-check=full option will generate more extensive messages with information about the memory leaks.

Code Quality Requirements

As with any program you write, your code should be readable and understandable to anyone who knows C. In particular, for full credit your code must observe the following requirements.

  1. The header (.h) file for the trie should only declare items that are shared between client programs that use the header and the file(s) that implement it. Don't include in the header file implementation details that should be hidden. If you make use of helper functions within T9 you should declare them at the top of the t9_lib.c file. Be sure to include comments that describe their function.
  2. Be sure to include appropriate function prototypes near the beginning of each source file for functions defined in that file whose declarations are not included in a header file.
  3. Comment sensibly, but not excessively. You should not use comments to repeat the obvious or explain how the C language works - assume that the reader knows C at least as well as you do. You should use comments to describe the use of variables and decision points within your code.
  4. Every source file should begin with a comment identifying the file, author, and purpose (i.e., the assignment or project).
  5. Use appropriate names for variables and functions
  6. No global variables. Use parameters (particularly pointers) appropriately.
  7. No unnecessary computation or excessive use of malloc or free - these are expensive. Don't make unnecessary copies of large data structures; use pointers. (Copies of ints, pointers, and similar things are cheap; copies of large arrays and structs are expensive.)

As with the previous assignment, you should use the cpplint.py style checker to review your code. While this checker may flag a few things that you wish to leave as-is, most of the things it catches, including whitespace errors in the code, should be fixed. We will run this style checker on your code to check for any issues that should have been fixed. Please use the class discussion board if you have questions about any of clint's warnings and whether they can be ignored.

Implementation Hints

  1. There are a lot of things to get right here; the job may seem overwhelming if you try to do it all at once. But if you break it into small tasks, each one of which can be done individually by itself, it should be quite manageable. Consider adding one function at a time, stubbing in additional functions if necessary.
  2. Before you start typing code into the computer, spend some time sketching out data structures and code (particularly trie node structs) on paper or on a whiteboard. Be sure you understand what you are trying to do before you start typing.
  3. Every time you add something new to your code (see hint #1), test it. Right Now! It is much easier to find and fix problems if you can isolate the potential bug to a small section of code you just added or changed. gdb and valgrind are your friends here to examine values while debugging.
  4. You will probably find it very useful to include code that can print the contents of the trie in some understandable format. This is not required, but how can you be sure your code is correct if you can't look at the trie that is built for a small set of input data?
  5. Start with a small data file and figure out in advance what the resulting trie should look like. Then verify that the program does, in fact, create that trie.
  6. gdb is your friend.
  7. To build the trie, you need some way to translate characters (primarily letters) from words in the dictionary file to the corresponding keypad digits. It is probably a great idea to include in your code a function that takes a character as an argument and returns the corresponding digit. This can be implemented with a series of if-elseif-else statements, but another way to do it is to have an array with one entry for each possible character. In the entry for each character, store the corresponding digit code. Then you can look up the code for a character without a complicated nest of if statements. (Recall that in C, as in Java and other languages, a character can be used as a small integer. That is particularly helpful for exactly this kind of application, where we might want to use a character value as an index into a table of data.)
  8. Be sure to check for errors like trying to open a nonexistent file to see if your error handling is working properly.
  9. Once you're done, read the instructions again to see if you overlooked anything.
  10. Reread the previous hint and obey.

Test Sequences:

The sequences below can be used to validate your trie against the supplied dictionary.txt file.

  1. 22737: acres, bards, barer, bares, barfs, baser, bases, caper, capes, cards, carer, cares, cases
  2. 46637: goner, goods, goofs, homer, homes, honer, hones, hoods, hoofs, inner
  3. 2273: acre, bard, bare, barf, base, cape, card, care, case
  4. 729: paw, pax, pay, raw, rax, ray, saw, sax, say
  5. 76737: popes, pores, poser, poses, roper, ropes, roses, sords, sorer, sores

Assessment

This homework is worth a total of 65 points.

There will be an autograder which evaluates whether your T9 implementation meets the interface specification. You should use the test suite you created during the previous homework to test your T9 functionality. You will also want to use Valgrind, similar to the way you used it in the debugging homework, to evaluate your memory management. You are encouraged to resubmit until you achieve a full score on the autograder.

This homework will also be evaluated manually, specifically to look at code clarity, style, and commenting. You should be sure to include your name and date at the top of the file. You should also be sure to use brief comments to describe the pupose of the module, and functionality of other sections.

Turning In

You will submit this homework to the Gradescope HW6: T9 Implementation, via Gitlab.

You will first update your Gitlab repository. Use git add and git commit to ensure that your updated t9_tests.c, and any required additional &ldquot;.txt&rdquot; files, are committed. This will be, at minimum, your t9_lib.c and Makefile.

These should be located in the hw6 folder, located at the top level of your repository. User git push to bring the origin/remote repository up-to-date.

Once you locate the Gitlab assignment you will tap the "GitLab" button on the bottom:

Picture of Gradescope interface highlighting the Gitlab button as opposed to the Upload button
Find your 374 repository in the list, and Submit Project.
Picture of Gradescope submission interface

Once you submit your code the autograder may take some time to run. You may resubmit for a higher grade, but your should always do as much testing as possible on your own platform before resubmitting.