CSE 303, Autumn 2009
Homework 3: digits

Due Thursday, October 29, 2009, 11:30 PM
100 points total

This assignment is inspired by a similar assignment used in CSE303 that was itself adapted from Professor Steve Wolfman of UBC.

This assign has one objective: helping you get used to basic C programming.

You will write a C program digits.c to explore a phenomenon called Benford's Law, which describes a surprising pattern in the frequency of occurrence of the digits 1-9 as the first digits of natural data.
(For example, in the number 328905, the first digit is 3.) You might expect each digit to occur with equal frequency in arbitrary data. Indeed, in truly random data over appropriate ranges, each digit does appear with equal frequency.  However, a substantial amount of data from diverse sources does not exhibit a uniform distribution. If you are curious about the distribution, Benford's Law, and the process by which it was discovered, check out the Wikipedia page about it.

You will write a C program digits.c that examines integer data and counts the first digits of those integers, then outputs statistics. Your program will read input from the console (standard input) using the scanf function. You will continually read integers, one per line, until you see the value -1 at which point your program will output its statistics and exit. For full credit, you should appropriately use an array to help you perform the counting of first digits. Your program must read from standard input. The following is an example file enrollment.txt that shows number of students enrolled at various major universities. (UW is the first data point with 28,570.) Notice that the file ends with -1 to signify the end of the input.  No number starting with a 0, including the number 0 itself, can appear in the file.

28570
12176
5476
543
3490
24892
28619
2595
603
2527
1465
1858
-1

If your program were compiled into an executable program named digits and run with the data above (place into a file and redirected to standard input), the output would be:
$ ./digits < enrollment.txt

Digit   Count   Percent   Histogram
    1       3    25.00%   ************
    2       5    41.67%   ********************
    3       1     8.33%   ****
    4       0     0.00%
    5       2    16.67%   ********
    6       1     8.33%   ****
    7       0     0.00%
    8       0     0.00%
    9       0     0.00%
TOTAL      12

You must match the above output format exactly (no, the horizontal lines are not included in the output). The columns above are separated by exactly 3 spaces. The 'Digit' column is exactly 5 spaces wide. The 'Count' column is exactly 5 spaces wide, with each count value right-aligned within the column. The 'Percent' column is exactly 7 spaces wide, with each percentage value right-aligned within the column and displaying exactly 2 digits after the decimal point. The 'Histogram' column should show exactly one * star character for every full 2% for that row's integer. For example, since the digit 2 above has 41.67%, there are 20 stars. You may assume valid input, that the input to your program will consist entirely of positive integers until -1. You may not make any assumption about the number of integers to read. It could be very few, none at all, or a very large number -- yes, even larger than fits in an int or a long or a .... 

After you complete this program, you will write a variation of this program, digits2, which takes one argument (and still reads from standard input for the data) that indicates how many digits should be examined and analyzed for each integer in the input file.  That is, digits2 1 should be identical to digits, as the 1 says to analyze the first digit.  digits2 3, however, would analyze the first, second, and third digits of each input entry.  So, for example, the output for digits2 4 on the above input file would be:

1 Digit   Count   Percent   Histogram
      1       3    25.00%   ************
      2       5    41.67%   ********************
      3       1     8.33%   ****
      4       0     0.00%
      5       2    16.67%   ********
      6       1     8.33%   ****
      7       0     0.00%
      8       0     0.00%
      9       0     0.00%
    TOTAL      12

2 Digit   Count   Percent   Histogram
      0       1     8.33%   ****
      1       0     0.00%  
      2       1     8.33%   ****
      3       0     0.00%
      4       5    41.67%   ********************
      5       2    16.67%   ********
      6       0     0.00%
      7       0     0.00%
      8       3    25.00%   ************
      9       0     0.00%
    TOTAL      12

3 Digit   Count   Percent   Histogram
      0       0     0.00%  
      1       1     8.33%   ****
      2       1     8.33%   ****
      3       2    16.67%   ********
      4       0     0.00%  
      5       2    16.67%   ********
      6       2    16.67%   ********
      7       1     8.33%   ****
      8       1     8.33%   ****
      9       2    16.67%   ********
    TOTAL      12

4 Digit   Count   Percent   Histogram
      0       1    10.00%   *****
      1       1    10.00%   *****
      2       0     0.00%  
      3       0     0.00%  
      4       0     0.00%  
      5       2    20.00%   ***********
      6       1    10.00%   *****
      7       3    30.00%   ***************
      8       1    10.00%   *****
      9       1    10.00%   *****
    TOTAL      10

Note a couple of things.  First, the digit 0 can appear anywhere except in the first position: that is, the first table only includes digits 1-9, but the remaining tables include 0-9.  Second, if you are examining the nth digit (where 1<=n<=argument) and come across a number with less than n digits, then it is not counted: so, because the above input file has two 3-digit numbers in it, there are only 10 instances of a "4 Digit" in the final table.  You can assume the argument is an integer between 1 and 9.

Your program should produce no errors or warnings from gcc -Wall . You should not use pointers on this assignment. In terms of grading, most of the points will come from the correctness of your programs, although some points will also come from the style and design and appropriate simplicity of your code. You should not use a more complex command or control structure when a more simple one would achieve the same result. You should reduce redundancy when reasonable.

Turn in will be via the dropbox, with a single file hw3.tar.gz containing both digits.c and digits2.c.