Detecting fraudulent data, from the back and from the front

Part 1 due: at 11pm on Thursday, February 14.
Part 2 due: at 11pm on Thursday, February 21.
Submit via the A6 Turn-in Page and A7 Turn-in Page.

This assignment will give you practice with the basics of statistical testing.

In this assignment, you will try to detect fraud in a dataset, in two different ways:

Part 1: By examining the last digits of the numbers in the dataset. For each datum such as 21063, you would examine the 3 and the 6.
Part 2: By examining the first digit of the numbers in the dataset. For each datum such as 21063, you would examine the 2.

To get started, download and extract the homework6.zip file.

You will write program in a new file, fraud_detection.py, that you will create. In some cases, you will write answers to answers.txt.

Coding style
Part 1: Detecting fraudulent data from the back
Part 2: Detecting fraudulent data from the front

Coding style

A portion of your grade depends on use of good coding style. Code written in good style is easier to read and understand, is easier to modify, and is less likely to contain errors. You should comment clearly, create functions as needed, and minimize redundancy (Don't Repeat Yourself).

Your program's documentation should allow us to understand it. You can get some ideas on how to document your code from the starter code of previous assignments. Follow good docstring conventions. You do not have to follow these instructions to the letter.

Different parts of this assignment require similar, but not exactly identical, work. When you notice such repetition, you should refactor and generalize your code. That is, rather than writing two similar routines, you should write one routine that takes the place of most of both.

You should decompose functions into smaller helper functions. One usual good rule of thumb is that if you cannot come up with a descriptive name for a function, then it may be too small or too large. If there is a good name for a portion of a function, then you might consider abstracting it out into a helper function.

We have not provided tests or exact results to check your program against. We encourage you to write your own tests and to use assert statements.

Important: Leave yourself some time to go back and refactor your code before you turn it in. Whenever you make a change to your program, ensure that it produces the same results as before. It is very possible to get a low grade on this assignment even if your program correctly executes of the requested calculations.

Part 1: Detecting fraudulent data from the back

In this part of the assignment, you will look for fraud in election returns from the disputed 2009 Iranian presidential election. You will examine the least significant digits of the vote totals — the ones place and the tens place.

The ones place and the tens place don't affect who wins. They are essentially random noise, in the sense that in any real election, each value is equally likely. Another way to say this is that we expect the the ones and tens digits to be uniformly distributed — that is, 10% of the digits should be “0”, 10% should be “1”, and so forth. If these digits are not uniformly distributed, then it is likely that the numbers were made up by a person rather than collected from ballot boxes. (People tend to be poor at making up truly random numbers.)

It is important to note that a non-uniform distribution does not necessarily mean that the data is fraudulent data. A non-uniform distribution is a great signal for fraudulent data, but it is possible for a non-uniform distribution to appear naturally.

Getting Started

Create a file called fraud_detection.py for your python program. As we have given you no starter code, it is up to you to create this program from scratch.

There are a few specific details that you must adhere to. The first is that your program's output should exactly match the following formatting, including capitalization and spacing (except where ___ is replaced by your answers).

2009 Iranian election MSE: ___
Quantity of MSEs larger than or equal to the 2009 Iranian election MSE: ___
Quantity of MSEs smaller than the 2009 Iranian election MSE: ___
2009 Iranian election null hypothesis rejection level p: ___
2008 United States election MSE: ___
Quantity of MSEs larger than or equal to the 2008 United States election MSE: ___
Quantity of MSEs smaller than the 2008 United States election MSE: ___
2008 United States election null hypothesis rejection level p: ___

Some of the problems request that you write functions that create plots and save them to files. Upon execution, your program should generate and write these files (even though there are no traces of these files in the printed output).

We do ask for specific functions that take exact parameter formats and return exact output formats. You must preserve the names, parameters, and output of these functions. The functions that we ask for are as follows:

extract_election_vote_counts(filename, column_names)
ones_and_tens_digit_histogram(numbers)
plot_iranian_least_digits_histogram(histogram)
plot_distribution_by_sample_size()
mean_squared_error(numbers1, numbers2)
calculate_mse_with_uniform(histogram)
compare_iranian_mse_to_samples(mse)

Lastly, you should use a main function to organize the execution of code in your program. You may begin with the following template code, which goes at the bottom of your program, after the definitions of all of your functions.

# The code in this function is executed when this file is run as a Python program
def main():
    ...

if __name__ == "__main__":
    main()

Your program should not execute any code, other than the main function, when it is loaded; that is, all statements should be inside a function, never at the top level.

Make sure to add import statements to gain access to tools and functions that are not included by default, such as matplotlib.pyplot or math. All import statements should be at the top of the file.

Problem 1: Read and clean Iranian election data

There were four candidates in the 2009 Iranian election: Ahmadinejad, Rezai, Karrubi, and Mousavi. The file election-iran-2009.csv contains data, reported by the Iranian government, for each of the 30 provinces. We are interested in the vote counts for each of these candidates. Thus, there are 120 numbers we care about in the file.

Write a function called extract_election_vote_counts that takes a filename and a list of names of columns to extract vote counts from. It should return a list of all of the vote counts from the respective rows. The order of the integers in the returned list does not matter. You may assume that the names that are passed in the list do exist as column names in the data file.

You may want to refer to String Methods and int() from the python language documentation. Instead of using split, you should make use of csv.DictReader , which will make it easier to produce a clean solution.

>>> extract_election_vote_counts("election-iran-2009.csv", ["Ahmadinejad", "Rezai", "Karrubi", "Mousavi"])
[1131111, 16920, 7246, 837858, 623946, 12199, 21609, 656508, ...

You will notice that the data contains double-quotes and commas. It is common to receive data that is not formatted quite how you would like for it to be. It is important to be able to clean data before analysis. It is up to you to handle the input by removing these symbols from the data before converting them into numbers.

Problem 2: Make a histogram

Write a function ones_and_tens_digit_histogram that takes as input a list of numbers and produces as output a list of 10 numbers. Each element of the result indicates the frequency with which that digit appeared in the ones place or the tens place in the input. Here is an example call and result:

>>> ones_and_tens_digit_histogram([0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765])
[0.21428571428571427, 0.14285714285714285, 0.047619047619047616, 0.11904761904761904, 0.09523809523809523, 0.09523809523809523, 0.023809523809523808, 0.09523809523809523, 0.11904761904761904, 0.047619047619047616]

In this example call, the index 1 of the list contains the element 0.14285714285714285 because the value 1 appears in 14.285714285714285% of the ones or tens digits of the given numbers.

In a number that is less than 10, such as 3, the tens place is implicitly zero. That is, 3 must be treated as 03. Your code should treat the tens digits of these values as zero.

Problem 3: Plot election data

Write a function called plot_iranian_least_digits_histogram that takes a histogram (as created by ones_and_tens_digit_histogram) and graphs the frequencies of the ones and tens digits for the Iranian election data. Save your plot to a file named iran-digits.png using pyplot.savefig. The function should return None. It is alright to have the name of the Iranian election file and the names of the Iranian candidates hard-coded as strings inside of this function.

>>> plot_iranian_least_digits_histogram(histogram)

The resultant plot should be identical to the following plot. Don't forget the x- and y-axis labels and the legend. Use pyplot.plot for the line itself. To create the legend at the top right corner, use plt.legend, and don't forget to use the label= optional argument to plot.

Histogram of last digits of Iran election results

You may wish to reference the pyplot tutorial. As a hint (that is also discussed in the tutorial) be sure that the call to pyplot.savefig comes before any call to pyplot.show; if savefig comes after, the graph will be empty.

The Iran election data are rather different from the expected flat line at y=.1. Are these data different enough that we can conclude that the numbers are probably fake? You can't tell just by looking at the graphs we have created so far. We will show more principled, statistical ways of making this determination.

Problem 4: Smaller samples have more variation

With a small sample, the vagaries of random choice might lead to results that seem different than expected. As an example, suppose that you plotted a histogram of 20 randomly-chosen digits (10 random numbers, 2 digits per number):

Histogram of 20 random digits

That looks much worse than the Iran data, even though it is genuinely random! Of course, it would be incorrect to conclude from this experiment that the data for this plot is fraudulent and that the data for the Iranian election is genuine. Just because your observations do not seem to fit a hypothesis does not mean the hypothesis is false — it is very possible that you have not yet examined enough data to see the trend.

Write a function called plot_distribution_by_sample_size. This function creates 5 different-sized collections of random numbers. Then, it plots the digit histograms for each of those collections. Your function should save your plot as random-digits.png. The function should return None.

The graph should look like the figure above (note the title, legend, and x- and y-axis labels), but it should have 5 plots, for 10, 50, 100, 1000, and 10000 random numbers. Those plots should be in different colors, so that you can distinguish them, and should all be mentioned in the legend. (The legend will be so large that it may cover up some of the lines; that is OK.) Naturally, random variation will make your graph look different than this one (and it will differ from run to run of your program).

You will want to use random.randint to generate numbers in the range [0, 99], inclusive.

>>> plot_distribution_by_sample_size()

Your plot demonstrates that the more datapoints there are, the closer the result is to the ideal histogram. We must take the sample size into account when deciding whether a given sample is suspicious.

Problem 5: Comparing variation of samples

You can visually see that some graphs are closer to the ideal than others. But, we would like a way to determine that computationally.

We would like a way to determine how similar two graphs are — and more specifically, we would like to determine whether the difference between graphs A and B is larger or smaller than the difference between graphs C and D. For this, we will define a distance metric. Given two graphs, it returns a number — a distance — that is 0 if the two graphs are identical, and is larger the more different two graphs are.

One common measure for the difference/distance between two datasets is the mean squared error. For each corresponding datapoint, compute the difference between the two points, then square it. The overall distance measure is the sum of the squares..

The use of squares means that one really big difference among corresponding datapoints yields greater weight than several small differences. It also means that the distance between A and B is the same as the distance between B and A.That is, (9 - 4)² is the same as (4 - 9)².

For example, suppose that you had the data that appears in the following table and plot:

x	f(x)	g(x)	h(x)
1	1	2	6
2	4	3	5
3	9	4	4

Example for mean squared error computation

The MSE difference between f and g is (1-2)² + (4-3)² + (9-4)² = 27.
The MSE difference between f and h is (1-6)² + (4-5)² + (9-4)² = 51.
The MSE difference between g and h is (2-6)² + (3-5)² + (4-4)² = 20.

The absolute values of the MSE are not interesting; it's only comparisons between them that are. These numbers show that g and h are the most similar, and f and h are the most different.

Write a function mean_squared_error that, given two lists of numbers, computes the mean squared error between the lists.

>>> mean_squared_error([1, 4, 9], [6, 5, 4])
51

Statistics background

The 120 datapoints from the 2009 Iranian election is a small sample of some hypothetical very large dataset. We don't know what that large dataset is, but we want to answer a question about it: does that dataset have uniformly-distributed ones and tens digits? If, just from looking at our small sample, we can determine that the large unknown dataset does not have uniformly-distributed ones and tens digits, then we can conclude that the observed sample is fraudulent (it came from some other source, such as some bureaucrat's imagination).

One sample can't conclusively prove anything about the underlying distribution. For example, there is a very small possibility that, by pure coincidence, a fair election might produce 120 numbers that all end with “11”. If we saw a sample whose ones-and-tens-digit histogram is all 1, we would be quite sure, but not 100% sure, that the data is fraudulent.

Our methodology is as follows: We take as an assumption that the observed sample (the Iranian election data) is not fraudulent — we call this the “null hypothesis”. Our question is whether we can, with high likelihood, reject that assumption. Our specific question is, “What is the likelihood that the Iranian election data is a sample of a large unknown dataset whose least significant digits are uniformly distributed?”

Problem 6: Comparing variation of samples

Augment your program with a function called calculate_mse_with_uniform that takes a histogram (as created by ones_and_tens_digit_histogram) and returns the mean squared error of the given histogram with the uniform distribution. Invoking calculate_mse_with_uniform with the the Iranian election results histogram (for the ones and tens digits) should return the result 0.00739583333333, or approximately 0.007.

>>> calculate_mse_with_uniform(histogram)
0.00739583333333

This number on its own does not mean anything — we don't know whether it is unusually low, or unusually high, or about average. To find out, we need to compare it to similarly-sized sets.

In a function called, compare_iranian_mse_to_samples that takes the Iranian MSE (as computed by calculate_mse_with_uniform) and compares it to the MSE to the uniform distribution for 10000 groups of random numbers, where each group is the same size as the Iranian election data (120 numbers). You will only use the last two digits of the random numbers.

Your function should determine where the passed in MSE (For our sample of the 2009 Iranian election data, this is ~0.007) appears relative to the computed MSE samples. In other words, determine how many of the random MSEs are larger than or equal to the Iran MSE, and how many of the random MSEs are smaller than the Iran MSE. Print these values. This function should return None. With each run of your program, you should expect a slightly different outcome from this function call.

>>> compare_iranian_mse_to_samples(0.00739583333333)
Quantity of MSEs larger than or equal to the 2009 Iranian election MSE: ___
Quantity of MSEs smaller than the 2009 Iranian election MSE: ___
2009 Iranian election null hypothesis rejection level p: ___

Put the output from one run of this function in answers.txt.

Interpreting statistical results

Below are some possibilities for the outcome of the previous question. They each include an explanation of how to interpret such a result. You should pay attention to the % of MSEs that the Iranian election result is greater than.

Suppose that the value 0.007 is larger than 9992 of the random MSEs, and is smaller than 8 of the random MSEs. If the election results were genuine, then there would only be a 0.08% chance of this result. This is highly unlikely, and we say that we are 99.92% confident that the data are fraudulent. More precisely, we say that “we reject the null hypothesis at the p=.0008 level”.
Suppose that the value 0.007 is larger than 9607 of the random MSEs, and is smaller than 393 of the random MSEs. If the election results were genuine, then there would only be a 3.93% chance of such a lopsided choice. This is unlikely, and we say that we are 96.07% confident that the data are fraudulent (we reject the null hypothesis at the p=.0393 level).
Suppose that the value 0.007 is larger than 8871 of the random MSEs, and is smaller than 1129 of the random MSEs. If the election results were genuine, then there would only be a 11.29% chance of such a lopsided choice. This is somewhat unlikely, but not so very surprising.

By convention, when an event is more than 5% likely, we state that it does not provide any statistically significant evidence against the null hypothesis. This means that, by convention, when an event is reported as "statistically significant", there is a 1/20 or less chance that it is just a fluke caused by unlucky (or lucky) random choices.
Suppose that the value 0.007 is larger than 4833 of the random MSEs, and is smaller than 5167 of the random MSEs. If the election results were genuine, then there would be a 51.67% chance of such a lopsided choice. This is not surprising at all; it provides no evidence regarding the null hypothesis.
Suppose that the value 0.007 is larger than 29 of the random MSEs, and is smaller than 9971 of the random MSEs. If the election results were genuine, then there would be a 99.71% chance of a result this lopsided or more lopsided. This is not surprising at all; it provides no evidence regarding the null hypothesis.

(Actually, the fit from this example is remarkably close to the theoretical ideal — much closer than one would typically expect a randomly-chosen sample of that size to be. But, it would come up once in a while, and maybe this is one of those times. Or maybe the data were fudged to look really, really natural — too natural, suspiciously natural. In any event, this result does not give grounds for accusing the election authorities of fraud, given what we were measuring.)

Suppose that you are only testing genuine, non-fraudulent elections. Then, 1 time in 20, the above procedure will cause you to cry foul and make an incorrect accusation of fraud. This is called a false positive, false alarm, or Type I error. False positives are an inevitable risk of statistics. If you run enough different statistical tests, then by pure chance, some of them will (seem to) yield an interesting result. You can reduce the chance of such an error by reducing the 5% threshold discussed above. Doing so makes your test less sensitive: your test will more often suffer a false negative, failed alarm, or Type II error — a situation where there really was an effect but you missed it. That is, there really was fraud but it was not detected.

In this assignment, you have computed approximated statistical confidence via simulation: generation of random data, then comparison. There are better, closed-form formulas for computing exact statistical confidence. We do not want to burden you with sophisticated math that you may not understand. More importantly, this idea of performing many trials, and seeing how likely or unlikely the real data are, is at the heart of all statistics, and it is more important than understanding a set of formulas.

If you are interested, Wikipedia has more on hypothesis testing. Reading this is optional, however.

Problem 7: Interpret your results

Interpret your results in answers.txt, using the ideas and vocabulary from the Interpreting Statistical Results section. State whether the data suggest that the Iran election results were tampered with before being reported to the press. Briefly justify your answer.

Problem 8: Other datasets

We have provided you with another dataset, from the 2008 US presidential election. It appears in file election-us-2008.csv, and is taken from Wikipedia. Consider the null hypothesis that it was indeed generated according to a uniform distribution of ones and tens digits. That is, your null hypothesis is that this data follows the patterns of a genuine data set.

Update your program to include calculations for the United States 2008 presidential election in addition to the 2009 Iranian election. Use the following list of candidates:

    us_2008_candidates  = ["Obama", "McCain", "Nader", "Barr", "Baldwin", "McKinney"]

Additionally, update your program to include all of the requested output as described in the Getting Started section. Make sure to refactor your code from previous solutions to be general enough to handle the 2008 United States Election as opposed to duplicating your code.

When a datum is missing (that is, an empty space in the .csv file), your calculation should ignore that datum. Do not transform it into a zero.

Do not include data for "other voters" in your calculations.

You do not need to produce graphs or plots for the US election — just the textual output.

In answers.txt, state whether you can reject that hypothesis, and with what confidence. Briefly justify your answer.

Submit part 1

Submit the following files via the A6 Turn-in Page.

fraud_detection.py
answers.txt

Furthermore, be sure that fraud_detection.py generates the following files upon execution:

iran-digits.png
random-digits.png

Don't forget the collaboration and reflection sections, in a file named answers.txt, and the survey about how much time you spent on part 1 of this assignment.

Part 2: Detecting fraudulent data from the front

In this part of the assignment, you will look for fraud in geographical data (place populations) and in financial data. You will examine the most significant digit of the data — that is, the leftmost digit.

For Part 2, please use the same fraud_detection.py file that you used in part 1. Add new code where necessary, and submit that same file again at the end.

You are allowed to change your code from Part 1. However, your program must still satisfy all the requirements of Part 1. When you run your program, it must produce all the output required by Part 1, then all the output required by Part 2. You must abide by all the requirements of Part 1 regarding the number of parameters and specification/behavior of each function. One way to generalize is to create a helper function, copy the body of an existing function to the helper function, and make the original function's body be little more than a call to the helper function. Since you defined the helper function, you are allowed to give it any name, any number of parameters, and any specification that you like.

In this part of the assignment, the structure of your program is entirely up to you. Your program's external correctness will be graded based on the .png images your code generates, so you are free to define any functions with any names and parameters you would like. You will find, however, that a good function decomposition will make this assignment much easier.

We will still grade your code by hand, so make sure to practice good coding and commenting style.

The Benford's Law distribution

Suppose that you measure some naturally-occurring phenomenon, such as the land area of cities or lakes, or the price of stocks on the stock market. You can plot a histogram of how frequently each digit (1-9) appears in the most significant place.

Benford's Law states that for natural processes such as these, the probabilities of seeing each digit in the most significant place are shown in the table below (from Wikipedia). Let P(d) be the probability of the given digit being the first one in a measurement. Benford's Law also states the remarkable fact that all of these processes produce the same histogram!

d	P(d)	Relative size of P(d)
1	30.1%
2	17.6%
3	12.5%
4	9.7%
5	7.9%
6	6.7%
7	5.8%
8	5.1%
9	4.6%

Think about this: your measurements were made in some arbitrary units, such as square miles or dollars, but what if you made the measurements in some other units, such as acres or euros? Would you expect the histogram to change?

In fact, you would not expect the shape of the histogram to differ just because you changed the units of measurement. This is because the units are arbitrary and unrelated to the underlying phenomenon. If the shape of the histogram did change, that could indicate that the data were fraudulent. This approach has been used to detect fraud, particularly in accounting and economics, but also in science.

Data are called scale invariant if measuring them in any units yields similar results. Many natural processes produce scale-invariant data. This means that for all natural processes, regardless of the units used, the histogram will be the same.

Benford's law only holds when the data has no natural limit nor cutoff. For example, it would not hold for grades in a class (which are in the range 0% to 100% or 0.0 to 4.0) nor for people's height in centimeters (where almost every value would start with 1 or 2). If you are interested, Wikipedia has more information about Benford's law. Reading the Wikipedia article is optional, however.

Problem 9: Plotting Benford's distribution

The histogram of first digit values for a distribution obeying Benford's Law can be computed as P(d) = log₁₀(1 + 1/d).

Plot the values produced by evaluating the Benford's Law formula for d on the interval [1, 10). Your plot should look like this, including the same x- and y-axis labels and the same legend:

Graph of Benford's Law first digits

Use pyplot.plot for the line itself. (You may also find the pyplot tutorial useful.) You will also need to use Python's math.log10 function.

Save your plot as scale-invariance.png. You will turn it in later.

Problem 10: Sampling datapoints to fit Benford's law

In this part of the problem, you will create artificial data that obeys Benford's Law.

Here is one way to generate datapoints that obey Benford's Law:

Pick a random number r uniformly in the range [0.0, 30.0). That is, the value is greater than or equal to 0, it is less than 30, and every value in that range is equally likely. Hint: use random.random or random.uniform.
Compute e^r, where e is the base of the natural logarithms, or approximately 2.71828. Hint: use math.e.

Generate 1000 datapoints using the above technique.

On your graph from Problem 9, draw another line, labeled “1000 samples”, that plots the frequency of the most significant digits of your 1000 samples (where each sample is the result of calculating e^r). Don't use the pyplot.hist routine — just use pyplot.plot, as you did above. Don't create a new graph — modify the one you made in Problem 9.

Hint: You may find it helpful to write helper functions, much as you did in Part 1.

Problem 11: Scale invariance

In this problem, you will see that the scale-invariance property holds.

On your graph from Problem 9 and 10, draw another line, where each datapoint is computed as π × e^r. For the label in the legend, use the string “1000 samples, scaled by $\pi$”. (The funny “$\pi$” will show up as a nicely-formatted π in the graph.)

Compare this line with the one you drew in Problem 10. There are some differences due to random choices, but overall it demonstrates the scale-independence of the distribution you just created. It also demonstrates the scale-independence of the Benford distribution, since it is so similar to the one you just created. (It is possible to demonstrate the scale-independence of Benford's Law mathematically as well. You are welcome to try doing this, but it is not required.)

You now have a single plot with three functions graphed on it (from problems 9-11). Turn in this file as scale-invariance.png.

Problem 12: Population of U.S. cities

We wish to know whether the population of cities in the United States follows Benford's Law.

Your directory contains a file SUB-EST2009_ALL.csv with United States census data. The file is in “comma-separated values” format. You can parse this file the same was as you did in Problem 1.

Create new new plot like the one from Problem 9. It should have only the theoretical Benford's distribution, calculated as log₁₀(1+1/d) for each digit d. Then plot on it a histogram of the frequency of the each first digit in the data from the "POPCENSUS_2000" column of the file. Label it "US (all)".

Just like in Problem 1, you might run into unclean data. You should handle this data in the same way you did then.

If any city has population 0 in the 2000 census, you may ignore this city.

Save this plot as population-data.png. You will turn it in later.

Your graph now has two curves plotted on it. From the similarity of the two curves, you can see that the population of U.S. cities obeys Benford's Law.

Problem 13: Population of places from literature and pop culture

The file literature-population.txt contains the populations of various places that appear in literature and pop culture. We would like to know whether these populations are characteristic of real population counts, in the sense of obeying Benford's Law equally well.

On your graph from Problem 12, plot the data from the file literature-population.txt. Label this plot "Literature Places". Notice that the data are similar to, but not exactly the same as, the the real dataset and the perfect Benford's-Law histogram.

Once again, note that this data may not be clean.

Are these data far enough from obeying Benford's Law that we can conclude that the numbers are fake? Or are they close enough to obeying Benford's Law that they are statistically indistinguishable from real data? You can't tell just by looking at the graphs we have created so far. We will show more principled, statistical ways of making this determination.

Your plot now has three lines plotted on it. Turn this plot in as population-data.png.

Problem 14: Smaller samples have more variation

The larger a sample is, the closer it is to the ideal distribution. With smaller samples, the vagaries of random choice might lead to results that seem different than random. As an extreme example, suppose that you plotted a histogram of just the first 10 cities in the SUB-EST2009_ALL.csv file. (You don't have to do this for the assignment.) It would look like this:

Plot of first digits of 10 cities

This graph is rather different from the Benford's-Law histogram, but that does not necessarily mean that city populations do not obey Benford's Law — maybe you have not yet examined enough data to see the trend.

Create a new graph like the one from Problem 10. Add to it plots for 10, 50, 100, and 10000 randomly-selected values or r. In other words, where in Problem 10 you used 1000 samples, here you should additionally use 10, 50, 100, and 10000. Your final graph will plot six functions. You should label these functions "10 samples", "50 samples", and so on, just as in Problem 10.

Save your plot as benford-samples.png. Turn in this plot.

Notice that the larger the sample size, the closer the distribution of first digits approaches the theoretically perfect distribution. This demonstrates that the more datapoints there are, the closer it is to the true distribution.

Statistics background

A distribution is a process that can generate arbitrarily many datapoints. A distribution obeys Benford's Law if, after choosing infinitely many points, the frequency of first digits is exactly P(d) = log₁₀(1 + 1/d).

A sample is a finite set of measurements. We wish to know whether it is possible that these measurements were chosen from some distribution that obeys Benford's Law.

The populations of places from literature is a small sample — just a few dozen datapoints. If, just from looking at our small sample, we can determine that the unknown distribution they came from does not obey Benford's Law, then we can conclude that the observed sample is not a result of a natural process (in this case, it is a result of the authors' choices, not a natural process). We can conclude that because place populations from the United States and elsewhere in the real world do obey Benford's Law.

One sample can't conclusively prove anything about the underlying distribution. For example, there is a very small possibility that, by pure coincidence, we might randomly choose 100 numbers that all start with the digit 1. If we saw a sample of 100 datapoints whose first-digit histogram is all 1, we would be quite sure, but not 100% sure, that the underlying distribution does not obey Benford's Law. We might say that we are more than 99.9% sure that the data is fraudulent.

So, our question is, “What is the probability that the populations from literature are a sample of an unknown distribution that does not obey Benford's Law?” In other words, if we had to bet on whether the literature place populations are fraudulent, what odds would we give? We will determine a quantitative answer to this question.

We take as an assumption that the observed sample (the populations from literature) are not fraudulent — we call this the “null hypothesis”. Our question is whether we can reject that assumption. Rejecting the assumption means determining that the sample is fraudulent. By “fraudulent”, we mean that it was generated by some other process — such as human imagination — that is different than the natural process that generates real population data.

Problem 15: Comparing variation of samples

Compute the mean squared error (the MSE distance) between Benford's distribution and the histogram of first digits of populations from literature. You should obtain the result 0.00608941252081, or approximately 0.006.

This number on its own does not mean anything — we don't know whether it is unusually low, or unusually high, or about average. To find out, we need to compare it to similarly-sized sets.

Generate 10,000 sets, each of which contains population data from n US towns (n is the size of the literature dataset). Each datapoint in each set should be chosen at random from the POPCENSUS_2000 data. For each of these sets, compute its MSE distance from Benford's distribution.

Now determine how many of the US MSEs are larger than or equal to the literature MSE, how many of the US MSEs are smaller, and how many are the same as the literature MSE. Your program should print out these quantities, in the following format:

Comparison on US MSEs to literature MSE:
larger/equal: ___  
smaller: ___

Also paste that output into your answers.txt file.

Hint: Your program should not open and parse the census file 10,000 or 560,000 times. Your program should only read the file once.

Problem 16: Interpret your results

Interpret your results. Is there evidence that the literature place populations were artificially generated? Answer this in your answers.txt file. Hint: you may wish to use a similar approach to that you used in Problem 9.

Submit part 2

You are almost done!

Look over your work to find ways to eliminate repetition in it. Then, refactor your code to eliminate that repetition. This is important when you complete each part, but especially important when you complete part 2. When turning in part 2, you should refactor throughout the code, which will probably include more refactoring in part 1. You will find that there is some similar code within each part that does not need to be duplicated, and you will find that there are also similarities across the two parts. You may want to restructure some of your part 1 code to make it easier for you to reuse in part 2.

Now look over your work and make sure you practiced good coding style.

At the bottom of your answers.txt file, in the “Collaboration” part, state which students or other people (besides the course staff) helped you with the assignment, or that no one did.

At the bottom of your answers.txt file, in the “Reflection” part, reflect on this assignment. What did you learn from this assignment? What do you wish you had known before you started? What would you do differently? What advice would you offer to future students?

Submit the following files via the A7 Turn-in Page.

fraud_detection.py
answers.txt

Furthermore, be sure that fraud_detection.py generates the following files upon execution:

scale-invariance.png
population-data.png
benford-samples.png

Answer a survey about how much time you spent on part 2 of this assignment.

Detecting fraudulent data, from the back and from the front

Contents

Coding style

Part 1: Detecting fraudulent data from the back

Getting Started

Problem 1: Read and clean Iranian election data

Problem 2: Make a histogram

Problem 3: Plot election data

Problem 4: Smaller samples have more variation

Problem 5: Comparing variation of samples

Statistics background

Problem 6: Comparing variation of samples

Interpreting statistical results

Problem 7: Interpret your results

Problem 8: Other datasets

Submit part 1

Part 2: Detecting fraudulent data from the front

The Benford's Law distribution

Problem 9: Plotting Benford's distribution

Problem 10: Sampling datapoints to fit Benford's law

Problem 11: Scale invariance

Problem 12: Population of U.S. cities

Problem 13: Population of places from literature and pop culture

Problem 14: Smaller samples have more variation

Statistics background

Problem 15: Comparing variation of samples

Problem 16: Interpret your results

Submit part 2