Homework 6: Detecting Fraudulent Data

Due: at 11pm on Friday, Feb 24. 2017
Submit via the turnin page. (Survey)

Learning Objectives:

Advice from previous students about this assignment: 14wi 15sp

In this assignment, you will try to detect fraud in a dataset by examining the last digits of the numbers in the dataset. For example, for a datum such as 21063, you would examine the 3 and the 6.

To get started, download and extract the homework6.zip file.

You will write your program in a new file, fraud_detection.py, that you will create. For problems 7 and 8 you will also write answers into the provided answers.txt.

Contents

Coding style

A portion of your grade depends on use of good coding style. Code written in good style is easier to read and understand, is easier to modify, and is less likely to contain errors. You should comment clearly, create functions as needed, and minimize redundancy (Don't Repeat Yourself).

Your program's documentation should allow us to understand it. You can get some ideas on how to document your code from the starter code of previous assignments. Follow good docstring conventions (You do not have to follow these conventions to the letter).

Different parts of this assignment require similar, but not exactly identical, work. When you notice such repetition, you should refactor and generalize your code. That is, rather than writing two similar functions, you should write one function that takes the place of most of both.

You should decompose functions into smaller helper functions. One usual good rule of thumb is that if you cannot come up with a descriptive name for a function, then it may be too small or too large. If there is a good name for a portion of a function, then you might consider abstracting it out into a helper function.

We have not provided tests or exact results to check your program against. We encourage you to write your own tests and to use assert statements.

Important: Leave yourself some time to go back and refactor your code before you turn it in. Whenever you make a change to your program, ensure that it produces the same results as before. It is very possible to get a low grade on this assignment even if your program correctly executes the requested calculations.

Detecting fraudulent data

In this assignment, you will look for fraud in election returns from the disputed 2009 Iranian presidential election. You will examine the least significant digits of the vote totals — the ones place and the tens place.

The ones place and the tens place don't affect who wins. They are essentially random noise, in the sense that in any real election, each value is equally likely. Another way to say this is that we expect the the ones and tens digits to be uniformly distributed — that is, 10% of the digits should be “0”, 10% should be “1”, and so forth. If these digits are not uniformly distributed, then it is likely that the numbers were made up by a person rather than collected from ballot boxes. (People tend to be poor at making up truly random numbers.)

It is important to note that a non-uniform distribution does not necessarily mean that the data is fraudulent data. A non-uniform distribution is a great signal for fraudulent data, but it is possible for a non-uniform distribution to appear naturally.

Getting Started

Create a file called fraud_detection.py for your python program. As we have given you no starter code, it is up to you to create this program from scratch.

There are a few specific details that you must adhere to. The first is that your program's output should exactly match the following formatting, including capitalization and spacing (except where ___ is replaced by your answers).

2009 Iranian election MSE: ___
Quantity of MSEs larger than or equal to the 2009 Iranian election MSE: ___
Quantity of MSEs smaller than the 2009 Iranian election MSE: ___
2009 Iranian election null hypothesis rejection level p: ___

2008 United States election MSE: ___
Quantity of MSEs larger than or equal to the 2008 United States election MSE: ___
Quantity of MSEs smaller than the 2008 United States election MSE: ___
2008 United States election null hypothesis rejection level p: ___

Some of the problems request that you write functions that create plots and save them to files. Upon execution, your program should generate and write these files.

We do ask for specific functions that take exact parameter formats and return exact output formats. You must preserve the names, parameters, and output of these functions. DO NOT add more parameters to these functions. The functions that we ask for are as follows:

Note that you are STRONGLY ENCOURAGED to create additional functions as needed, but we require that the functions above be present with these EXACT names and parameters.

Lastly, you should use a main function to organize the execution of code in your program. You may begin with the following template code, which goes at the bottom of your program, after the definitions of all of your functions.

# The code in this function is executed when this file is run as a Python program
def main():
    ...

if __name__ == "__main__":
    main()

This if statement and the call to main inside of it should be the last line in your file, after all of the function definitions. Your program should not execute any code, other than the main function, when it is loaded; that is, all statements should be inside a function, never at the global level. It is o.k. to add testing functions to this file (although you may wish to put them in a separate file as was done for HW5) but be sure to comment out any calls to these functions you may have put in at the global level. Do not place any assert statements outside of functions - place them inside of test functions.

Make sure to add import statements to gain access to tools and functions that are not included by default, such as matplotlib.pyplot or math. All import statements should be at the top of the file, before any function definitions.

Problem 1: Read and clean Iranian election data

There were four candidates in the 2009 Iranian election: Ahmadinejad, Rezai, Karrubi, and Mousavi. The file election-iran-2009.csv contains data, reported by the Iranian government, for each of the 30 provinces. We are interested in the vote counts for each of these candidates. Thus, there are 120 numbers we care about in the file.

Write a function called extract_election_vote_counts that takes a filename and a list of column names and returns a list of integers that contains the values in those columns from every row. The order of the integers does not matter.

Here is a sample of what your function should return:

>>> extract_election_vote_counts("election-iran-2009.csv", ["Ahmadinejad", "Rezai", "Karrubi", "Mousavi"])

[1131111, 16920, 7246, 837858, 623946, 12199, 21609, 656508, ...


Hints:

You should make use of csv.DictReader , which will make it easy to read lines from a csv file. Remember to close any file you open in your program.

You will notice that the data contains double-quotes and commas. It is common to receive data that is not formatted quite how you would like for it to be. It is important to be able to clean data before analysis. It is up to you to handle the input by removing these symbols from the data before converting them into numbers.

To do this, you may want to refer to String Methods and int() from the Python language documentation. In particular, we recommend using the replace method to replace unwanted characters with the empty string ("").

Problem 2: Make a histogram

Write a function ones_and_tens_digit_histogram that takes as input a list of numbers and produces as output a list of 10 numbers. In the returned list, the value at index i is the frequency with which digit i appeared in the ones place OR the tens place in the input list. (We are pooling together all the digits found in the ones places and the tens places in the input list.)

For example, the element at index 4 of the list might be 0.1 if the value 4 appears in 10% of the ones or tens digits of the given numbers.

In the input, in a number that is less than 10, such as 3, the tens place is implicitly zero. That is, 3 must be treated as 03. Your code should treat the tens digits of these values as zero.

Here are some example calls to the function in the interpreter and their results:

>>> ones_and_tens_digit_histogram([127, 426, 28, 9, 90])

[0.2, 0.0, 0.3, 0.0, 0.0, 0.0, 0.1, 0.1, 0.1, 0.2]

>>> ones_and_tens_digit_histogram([0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765])

[0.21428571428571427, 0.14285714285714285, 0.047619047619047616, 0.11904761904761904, 0.09523809523809523, 0.09523809523809523, 0.023809523809523808, 0.09523809523809523, 0.11904761904761904, 0.047619047619047616]


Hints:

To get the ones and tens digits of a number, we recommend that you review how the % and / operators work on integers. For example, what is the result of 21 % 10? Why? What is the result of 21 / 10? Why? What about 3 / 10 and 0 % 10?

In order to calculate an average for digit i, you will need to know (a) the number of times i appears in the ones or tens digit — for the numerator — and (b) the total number of digits — for the denominator. What should the denominator be? Make sure your averages aren't off by a factor of two.

Problem 3: Plot election data

Write a function called plot_iranian_least_digits_histogram that takes a histogram (as created by ones_and_tens_digit_histogram) and graphs the frequencies of the ones and tens digits for the Iranian election data. Save your plot to a file named iran-digits.png using plt.savefig. The function should return None. It is all right to have the name of the output file and labels for the graph hard-coded as strings inside of this function.

Calling:

>>> plot_iranian_least_digits_histogram(histogram)

on the histogram created from the numbers in election-iran-2009.csv should save a plot called iran-digits.png that is identical to the following plot. Don't forget the x- and y-axis labels and the legend (although the entries in the legend can come in any order, and the colors of the lines may be different). Use plt.plot for the line itself (don't forget to use the label= optional argument). To create the legend at the top right corner, use plt.legend.

Histogram of last digits of Iran election results

The Ideal line above represents the "uniform distribution" (where each digit occurs with equal frequency — that is, utterly randomly). In order to make this line, you will need a list of length 10 in which every element is 0.1.

The Iran election data are rather different from the expected flat line at y = 0.1. Are these data different enough that we can conclude that the numbers are probably fake? No — you can't tell just by looking at the graphs we have created so far. We will show more principled, statistical ways of making this determination.

Hints:

You may wish to reference the pyplot tutorial. As a hint (that is also discussed in the tutorial) be sure that the call to plt.savefig comes before any call to plt.show. If plt.savefig comes after plt.show, then the graph will be empty if you run your program from the command line (as we will be doing when we test them).

To use pyplot, add the line import matplotlib.pyplot as plt to the top of your file. Then you can call plotting methods as plt.plot, plt.show, plt.savefig, and others.

Problem 4: Smaller samples have more variation

With a small sample, the vagaries of random choice might lead to results that seem different than expected. As an example, suppose that you plotted a histogram of 20 randomly-chosen digits (10 random numbers, 2 digits per number):

Histogram of 20 random digits

That looks much worse than the Iran data, even though it is genuinely random! Of course, it would be incorrect to conclude from this experiment that the data for this plot is fraudulent and that the data for the Iranian election is genuine. Just because your observations do not seem to fit a hypothesis does not mean the hypothesis is false — it is very possible that you have not yet examined enough data to see the trend.

Write a function called plot_distribution_by_sample_size. This function creates five different collections (one of size 10, another of size 50, then 100, 1000, and 10,000) of random numbers where every element in the collection is a different random number x such that 0 <= x < 100 (Note: 100 is not included here, why not?). Then the function plots the digit histograms for each of those collections on one graph. Your function should save your plot as random-digits.png. The function should return None.

Calling:

>>> plot_distribution_by_sample_size()

should produce a graph similar to the figure above (Don't forget the title, legend, and x- and y-axis labels), HOWEVER it should instead have 5 lines, one for each of the samples (10, 50, 100, 1000, and 10,000), plus a line for the Ideal (so a total of 6 lines). Those lines should be in different colors, so that you can distinguish them, and should all be mentioned in the legend. (The legend will be so large that it may cover up some of the lines; that is OK.) Naturally, random variation will make your graph look different than this one (and it will differ from run to run of your program). It is fine if the range of the y axis varies from run to run.

Your plot demonstrates that the more datapoints there are, the closer the result is to the ideal histogram. We must take the sample size into account when deciding whether a given sample is suspicious.

Hints:

You will want to use random.randint to generate numbers in the range [0, 99], inclusive. Make sure to add import random to the top of the file, too.

You may notice that when running your program in Canopy, if you have created other plots earlier in the program or if you do not close the plotting window after you run the program, successive plots will be drawn to the same window. To avoid this, you may want to put in a call to plt.clf() at the beginning of your function.

Similarly, if you are running from the command line and the plot diagram that pops up (and pauses your program) annoys you, you may decide to comment out the calls to show earlier in your program. This is fine, but make sure to replace those calls with calls to clf. If you don't do this, you will see that the lines from the previous plot(s) are appearing on your current plot.

Problem 5: Comparing variation of samples

In the previous problem, you can visually see that some lines we have plotted are closer to the Ideal line than others. Rather than judging the closeness of two lines with our eyeballs, we would like a way to computationally determine how similar two lines are.

Specifically, we would like to determine whether the difference between lines A and B is larger or smaller than the difference between lines C and D. For this, we will define a distance metric. Given two lines, it returns a number — a distance — that is 0 if the two lines are identical, and is larger the more different the two lines are.

One common measure for the difference/distance between two datasets is the mean squared error. MSE is computed as follows:

The use of squares means that one really big difference among corresponding datapoints yields greater weight than several small differences. It also means that the distance between A and B is the same as the distance between B and A. That is, (9 - 4)2 is the same as (4 - 9)2.

For example, suppose that you had the data that appears in the following table and plot:

x f(x) g(x) h(x)
1 1 2 6
2 4 3 5
3 9 4 4

Example for mean squared error computation

The MSE difference between f and g is ((1-2)2 + (4-3)2 + (9-4)2) / 3 = 9.
The MSE difference between f and h is ((1-6)2 + (4-5)2 + (9-4)2) / 3 = 17.
The MSE difference between g and h is ((2-6)2 + (3-5)2 + (4-4)2) / 3 = 6.66666666667.

The actual values of the MSE (in this case: 9, 17 and 6.66666666667) are not interesting; it's only comparisons between them that are. For example, these numbers show that g and h are the most similar (have the smallest MSE), and f and h are the most different (have the largest MSE).

Write a function mean_squared_error that, given two lists of numbers, returns the mean squared error between the lists. Here is an example call to that function in the interpreter and its result:

>>> mean_squared_error([1, 4, 9], [6, 5, 4])
17

Statistics background

The 120 datapoints from the 2009 Iranian election were generated by some process. That process could have been (a) an actual valid election or (b) some other process, like a bureaucrat making up 120 numbers. If process (a) was used, then we might expect the data points to have uniformly-distributed ones and tens digits. Although we also know that any one set of 120 data points could look suspicious (e.g. all 120 numbers end with “77”) yet still be data from a valid election - there is a very small possibility that, by pure coincidence, a fair election could have produced these results.

If you ran your code for problem 4 several times you saw how a histogram of a sample of 10 *actual* random numbers often did not look much like the "Ideal" histogram. 100 *actual* random numbers looked more "Ideal", and 1,000 looked even more "Ideal". Results from an Iranian election with these candidates and these provinces will only ever consist of 120 data points. We shouldn't expect a histogram of 120 points to look as "Ideal" as a histogram made from 1,000 points. But what we are really wondering is, does the histogram for the reported results of the 2009 Iranian election look like a histogram we would get for a set of 120 randomly generated data points?

While we could potentially use a statistical formula to calculate how likely it is that the results of the 2009 Iranian election occured that way by chance, we will use a different approach. Instead, we will generate 10,000 samples of 120 randomly generated data points and see how the 2009 Iranian election results compare to those 10,000 samples. As in problem 4, we expect that a larger number of samples will lead to less variation (so we will use 10,000 sets of 120 numbers rather than say just 10 sets of 120 numbers). In particular we will see how the 2009 Iranian election results compare to the ideal distribution and also how the 10,000 actual random samples compare to the ideal distribution. We talk more below in problem 6 and 7 about how we will interpret this comparison.

Our methodology is as follows: We take as an assumption that the observed sample (the Iranian election data) is not fraudulent (they are the actual, unadulterated vote counts from the election) — we call this the “null hypothesis”. Our question is whether we can, with high likelihood, reject that assumption. Our specific question is, “What is the likelihood that the Iranian election data is a sample of a large unknown dataset whose least significant digits are uniformly distributed?”

Problem 6: Comparing variation of samples

Implement the function calculate_mse_with_uniform that takes a histogram (as created by ones_and_tens_digit_histogram) and returns the mean squared error of the given histogram with the uniform distribution. Invoking calculate_mse_with_uniform with the the Iranian election results histogram (for the ones and tens digits) should return the result 0.000739583333333, or approximately 0.0007. Calling the function in the interpreter you would see the following:

>>> calculate_mse_with_uniform(histogram)
0.000739583333333

Of course, this number on its own does not mean anything — we don't know whether it is unusually low, or unusually high, or about average. What MSE would 120 randomly generated data points have?

Write a function called compare_iranian_mse_to_samples that takes two inputs: the Iranian MSE (as computed by calculate_mse_with_uniform) and the number of data points in the Iranian dataset (how can you get this information? Remember that for the Iranian dataset, this should be 120 — however, do not hard-code 120 into your program.). This function will build 10,000 groups of random numbers, where each group is the same size as the Iranian election data (120 numbers), and compute the MSE with the uniform distribution for each of these groups. It will then compare each of these 10,000 MSEs to the Iranian MSE that was passed into the function as a parameter.

In particular, determine how many of the 10,000 random MSEs are larger than or equal to the Iran MSE (for our sample of the 2009 Iranian election data, the MSE is ~0.0007) , and how many of the 10,000 random MSEs are smaller than the Iran MSE. Print these values. This function should return None. With each run of your program, you should expect a slightly different outcome from this function call. Calling the function in the interpreter you would see the following (See Problem 7 below for more about the p value.):

>>> compare_iranian_mse_to_samples(0.000739583333333, 120)
Quantity of MSEs larger than or equal to the 2009 Iranian election MSE: ___
Quantity of MSEs smaller than the 2009 Iranian election MSE: ___
2009 Iranian election null hypothesis rejection level p: ___
Hints:

Some things you are asked to do in this problem are very similar to things you've done before. (For example, you've already written code to produce a random sample of some size and compute its histogram.) Rather than copying that old code into the solution for this problem (which is the wrong approach), you should move that old code into a function that you can use both in the old problem and this one. You will lose points if we see code duplicated multiple times.

It is not okay to hard-code the value 120 into your program. Hard-coding a value such as this one makes your code less flexible — in this case, if you hard-code 120, then your analysis won't be valid on any dataset other than this one.

Problem 7: Interpret your results

Interpreting statistical results

Below are some possibilities for the outcome of the previous question. They each include an explanation of how to interpret such a result. You should pay attention to the % of random MSEs that are smaller than the Iranian election MSE.

Remember that the MSEs in this function and discussed below are a measure of how different the ones_and_tens_digit_histogram for that dataset were from the ideal, uniform distribution. The greater the MSE, the more different from the ideal, and therefore the more likely the data are fraudulent.

Also, remember that the null hypothesis is that the Iranian election data are not fraudulent. We are looking to see if the Iranian election was different enough from a random distribution that we can "reject the null hypothesis" (conclude the data are fraudulent).

An interesting note about the phrase "statistically significant": suppose that you are only testing genuine, non-fraudulent elections. Then, 5% of the time, the above procedure will cause you to cry foul and make an incorrect accusation of fraud. This is called a false positive or false alarm. False positives are an inevitable risk of statistics. If you run enough different statistical tests, then by pure chance, some of them will (seem to) yield an interesting result (relevant XKCD cartoon). You can reduce the chance of such an error by reducing the 5% threshold discussed above. Doing so makes your test less sensitive; your test will more often suffer a false negative or failed alarm — a situation where there really was an effect but you missed it. That is, there really was fraud but it was not detected.

In this assignment, you have computed approximated statistical confidence via simulation: generation of random data, then comparison. This idea of performing many trials, and seeing how likely or unlikely the real data are, is at the heart of all statistics, and it is more fundamental than memorizing a set of complex formulas. This approach can be applied in many situations, including those when you do not have an explicit formula.

Interpret your results and write your answers in answers.txt. State whether the data suggest that the Iranian election results were fraudulent. Then justify your answer by giving p and comparing it to the threshhold for statistical significance, p=0.05.

Problem 8: Other datasets

We have provided you with another dataset, from the 2008 US presidential election. It appears in the file election-us-2008.csv, and is taken from Wikipedia. Consider the null hypothesis that it is a genuine dataset (and therefore should have a uniform distribution of ones and tens digits).

Update your program to include calculations for the United States 2008 presidential election in addition to the 2009 Iranian election. Use the following list of candidates:

    us_2008_candidates  = ["Obama", "McCain", "Nader", "Barr", "Baldwin", "McKinney"]

At this point you should update your program to include all of the requested output as described in the Getting Started section. However, you do not need to produce graphs or plots for the US election — just the textual output.

You should modify your program so that there is no duplication between compare_iranian_mse_to_samples and compare_us_mse_to_samples.

In answers.txt, state whether you can reject that hypothesis, and with what confidence. Briefly justify your answer. (Not to give away the answer, but if your results tell you to reject the null hypothesis and conclude that the US election was fraudulent, then there's probably a bug in your code!)

Hints:

When a datum is missing (that is, an empty space in the .csv file), your calculation should ignore that datum. Do not transform it into a zero.

The best way to remove duplication between compare_iranian_mse_to_samples and compare_us_mse_to_samples is to move your logic into a new, third function that takes several arguments (and has a useful docstring!) and have both of these functions call the third function with arguments specific to either the US or Iran.

Do not include data for "other voters" in your calculations.

Remember that, unlike the Iranian dataset, the US dataset is of some length other than 120. Make sure to use the correct length!

Submit your work

You are almost done!

Now you are done!