In this assignment you will create your own version of the Unix command wc. You will read in a file (or multiple files) and report stats including the number of lines, words, and characters in the file. Additionally, your version of wordcount should print the total number of lines for all files at the end. If there is an option (flag) provided, print only the count that is relevant.

Your code should behave as follows:

Compile

Compile with the command gcc -g -Wall -std=c11 -o wordcount wordcount.c. Your program must compile and run without errors or warnings when compiled and executed on Calgary.

Remember that you must recompile your code each time you update your program and want to see any changes reflected the next time you run the program.

Input Handling

The program requires one or more input files.

  • If there are no input files, print the usage message and exit with EXIT_FAILURE.
  • Does not need to work with redirected input (i.e., input from stdin).
  • If an error occurs when opening or reading a file, the program should write an appropriate error message to stderr saying the file is skipped. And then, the program will continue processing any remaining files on the command line.
  • Tip: Refer to the fprintf function to send output to stderr.

The program should also accept these three options: -l, -w, -c.

  • If an option is detected, the program will output only the number of lines, words, or characters respectively.
  • If an option is specified, the program will NOT print the total number of lines in all the files.
  • The program shouldn’t process more than one option. If more than one option is provided, only the first one should be activated.

To simplify the option handling you will assume that any options will come before the names of the input files. In other words, if you detect any input argument string that is not a valid option, you will assume that it, and any subsequent arguments, are file names.

Be able to handle input lines containing up to 500 characters (including the terminating \0 byte). The performance for other files is undefined, and will not be evaluated.

Functionality

For each input file, calculate the number of lines, words, and characters.

Print a message with the results and the file name, similar to the output of wc. However, it should follow these additional formatting guidelines (you can find examples of messages below):

  • As always, if there are no input files, you should print this usage message: Usage: ./wordcount requires an input file.
  • When one of the input files doesn’t exist, the message wordcount: <name of file>: No such file or directory is shown
  • Just use space characters (like ) to separate numbers and file names
  • Label each number (i.e., lines: x words: y characters: z)

Additionally, print the total number of lines for all files if no option is provided.

Tip

You can check against wc on any test file; the numbers should be the same. There may be some edge cases where there is a difference of 1 in line count or character count that is caused by the newline character at the end of the file. We will not test you on this special case. To make things easier, just end all of your files with a newline.

Exit

If the program terminates prematurely because of some error, it should print an appropriate error message to stderr and exit with an exit code of EXIT_FAILURE (defined in <stdlib.h> – see the description of the exit() function).

If the program terminates normally after attempting to open and process all of the files listed on the command line, it should terminate with an exit code of EXIT_SUCCESS. This is normally done by returning the value EXIT_SUCCESS as the int result of the main function.

Example Run

The hw3 folder includes the following files that you can use for testing these commands. You are also welcome to create your own files and test your wordcount program with them. When running the following commands, your program should match the following behavior exactly (this includes message formatting):

$ ./wordcount
Usage: ./wordcount requires an input file.
$ echo $?
1
$ ./wordcount -l
Usage: ./wordcount requires an input file.

$ ./wordcount -l hello.c
lines: 10 hello.c
$ ./wordcount -w shorttext
words: 13 shorttext
$ ./wordcount -c shorttext longtext
characters: 68 shorttext
characters: 1637 longtext

$ ./wordcount hello.c "NON FILE" shorttext
lines: 10 words: 28 characters: 175 hello.c
wordcount: NON FILE: No such file or directory
lines: 4 words: 13 characters: 68 shorttext
Total Lines = 14

$ ./wordcount -l -wc shorttext
wordcount: -wc: No such file or directory
lines: 4 shorttext
$ ./wordcount -line shorttext
wordcount: -line: No such file or directory
lines: 4 words: 13 characters: 68 shorttext
Total Lines = 4
$ ./wordcount shorttext -l
lines: 4 words: 13 characters: 68 shorttext
wordcount: -l: No such file or directory
Total Lines = 4

The autograder includes many more test cases, but these examples demonstrate what is defined by the specification above.

You can also use wc to check that the numbers output by wordcount are correct. Here, we see that wc produces the same numbers as wordcount does above.

$ wc hello.c "NON FILE" shorttext
 10  28 175 hello.c
wc: 'NON FILE': No such file or directory
  4  13  68 shorttext
 14  41 243 total

Technical Requirements

You should pay attention to the following guidelines for meeting performance expectations.

  1. Use standard C library functions where possible; do not reimplement operations available in the basic libraries. For instance, strncpy in <string.h> can be used to copy \0-terminated strings; you should not be writing loops to copy such strings one character at a time.
  2. You should use “safe” versions of file and string handling routines such as fgets and strncpy instead of routines like gets and strcpy. The safe functions allow specification of maximum buffer or array lengths and will not overrun adjacent memory if used properly.
  3. Since this program is likely relatively short, all of the functions should be in a single file called wordcount.c. You should arrange your code so that related functions are grouped together in a logical order in the file.
  4. Your program must be robust. It should not crash (segfault or otherwise) or produce meaningless or incorrect output regardless of the contents of command line parameters or input files (except, of course, you are not required to deal with files or string parameters with lines longer than the limits given above).

Code Quality Requirements

As with any program you write, your code should be readable and understandable to anyone who knows C. In particular, for full credit your code must observe the following requirements.

  1. Divide your program into suitable functions, each of which does a single well-defined task. For example, there should almost certainly be a function that processes a single input file, which is called as many times as needed to process each of the files listed on the command line (and which, in turn, might call other functions to perform identifiable subtasks).

Caution

Your program most definitely may not consist of one huge main function that does everything.

However it should not contain tiny functions that only contain isolated statements or code fragments instead of dividing the program into coherent pieces.

  1. Be sure to include appropriate function prototypes/declarations near the beginning of the file so the actual function definitions can be in a logical sequence and related functions are grouped together.

  2. Comment sensibly, but not excessively. You should not use comments to repeat the obvious or explain how the C language works – assume that the reader knows C at least as well as you. Your code should, however, include the following minimum comments:

  • Every function must include a comment above it that explains what the function does (not how it does it), including the significance of all arguments and any effects on or use of global variables (to the extent that there are any).
  • Every significant variable must include a comment that is sufficient to understand what information is stored in the variable and how it is stored. In some cases, you may describe many variables in one comment.
  • Any code based on someone else’s work (such as Stack Overflow or other people) must include a comment for citation. There is space for this provided in the header comment template. Please be cautious of overreliance on Generative AI tools to write code for you.
  • In addition, there should be a comment at the top of the file giving basic identifying information, including your name, the date, and purpose of the file.
  1. Use appropriate names for variables and functions: nouns or noun phrases suggesting the contents of variables or the results of value-returning functions; verbs or verb phrases for void functions that perform an action without returning a value. Variables of local significance like loop counters, indices, or pointers should be given simple names like i, or p, and often do not require further comments.

  2. Avoid global variables. Use parameters (particularly pointers) appropriately.

  3. You may use an appropriate #define MAXLINE command to set the maximum line length mentioned above.

  4. No unnecessary computation. Don’t make unnecessary copies of large data structures; use pointers. (Copies of ints, pointers, and similar things are cheap; copies of arrays and large structs are expensive.) Don’t read the input by calling a library function to read each individual character. Read the input a line at a time (it costs just about the same to call an I/O function to read an entire line into a char array as it does to read a single character). Your code should be simple and clear, not complex and containing lots of micro-optimizations that don’t matter.

  5. You should use the cpplint.py style checker, which is provided to you in the git repo:

  • Use ./cpplint.py --clint wordcount.c to review your code. If this fails, you must call python3 explicitly: python3 ./cpplint.py --clint wordcount.c.
  • There is more help for using this code on the CSE 333 page.
  • While this checker may flag a few things that you wish to leave as-is (notably: You may ignore warnings about strtok), most of the things it catches, including whitespace errors in the code, should be fixed. We will run this style checker on your code to check for any issues that should have been fixed. Use the Ed or office hours if you have questions about particular clint warnings.

Tip

All reasonable programming text editors have commands or settings to use spaces instead of tabs at the beginning of lines, which is required by the style checker and is much more robust than having tabs in the code. For example, if you are a emacs user, you can add the following line to the .emacs file in your home directory to ensure that emacs translates leading tabs to spaces: (setq-default indent-tabs-mode nil).

Implementation Hints

  1. There are a lot of things to get right here; the job may seem overwhelming if you try to do all of it at once. But if you break it into small tasks, each one of which can be done individually by itself, it should be quite manageable. For instance, figure out how to process a single file before you implement the logic to process all of the files on the command line. (Or, vice-versa, but start small and test before you move on).

  2. Think before you code. You will ultimately get the job done faster, better, and with less pain if you spend some time to sketch your design (which functions are needed? what exactly do they do? what are the main data structures?) before you write detailed code. Start coding by writing function headings and heading comments and creating significant variables – and commenting those too. Then as you write detailed code and test it you will have your written design information in the comments to compare and check as you work on the code.

  3. I/O is relatively expensive, while storing one more integer is relatively inexpensive. As a result, it is likely a good idea to write one function that calculates all the potential output values in one go, and use the options to determine which ones to print to stdout.

  • You may need to do some additional research on File I/O using FILE objects. Here is a reference page for starters. The code from the lecture8 file_demo.c shows examples of using a few basic File I/O functions.

  • Make sure that every file stream that is open should be subsequently closed.

  1. While traversing the file and counting characters, words, and lines, it might be useful to know that words can be ended with these characters: a space “”, a tab “\t”, and a newline “\n”. Lines are ended with the character: “\n”.

  2. Every time you add something new to your code, test it. Right Now! Immediately!! BEFORE YOU DO ANYTHING ELSE!!! It is much easier to find and fix problems if you can isolate the potential bug to a small section of code you just added or changed.

  • If you make your own test files, make sure there’s a newline at the end of the file, i.e., hint enter again at the end of the file! Some applications are sensitive to the absence of a newline at the end of the file, so it’s best practice to include one.
  1. The standard C library contains many functions that you will find useful. In particular, look at the <stdio.h>, <string.h>, <ctype.h> and <stdlib.h> libraries.

  2. strlen tells you how many characters are in a string, minus the null terminator (\0).

  3. Make sure to use the compiler -Wall option. Don’t waste time searching for errors that the compiler or run-time tests could have caught for you.

  4. Be sure to test for errors like trying to open or read a nonexistent file to see if your error handling is working properly.

  5. git commit early and often! It’s nice to have a history of your program versions.

Tip

Once you’re done, read the instructions again to see if you overlooked anything.