Linguistic analysis

Due: at 11pm on Thursday, February 28.
Submit via the A8 Turn-in Page.

In this assignment, you will learn about linguistic analysis of texts to determine authenticity. You will also get practice structuring your own program — we do not make any requirements regarding what functions you create. However, we expect you to implement functions to reduce the redundancy of your program, and document the specification of each of your functions using docstrings.

This homework may appear shorter than previous assignments, however we expect it to be just as challenging. We offer our typical advice: start early, test and run your code as frequently as possible, and think carefully about algorithms before writing them in Python. Your brain is the most valuable programming tool you possess - think before you write code, and remember to approach debugging with a scientific mindset.

If you have any clarification questions about this specification, look on the class discussion board to see if your question has been answered. If it has not been answered, post it yourself!

If you have a specific question about your implementation or design, send an email to the staff mailing list.

Contents:

Introduction
Reading and cleaning the scripts
Comparing vocabulary (single words/1-grams)
Comparing phrases (pairs of words/2-grams)
- Bigram output
Verifying your solution
Submit your work

Introduction

You can compare two texts to see how they are similar or different in vocabulary, word usage, punctuation, and more. This sort of analysis has been used to investigate the authorship of the Federalist Papers, Shakespeare's writings, Primary Colors, and other literary works. In this assignment, you will use linguistic analysis to investigate modern TV shows set in the 1960s, such as Mad Men. Your objective is to determine whether a show uses language authentically (in the same way that language was actually used in the 1960s), or whether it makes mistakes that suggest that the show was created more recently.

In particular, you will search for prochronisms in TV and movie scripts. A prochronism is use of a not-yet-existing cultural artifact, such as if in 1996 someone were to dance Gangnam Style. (By contrast, an anachronism is use of an obsolete cultural artifact, such as dancing the Macarena today.)

One type of prochronism is is using a word that did not yet exist, such as “iPhone”. Another is using a phrase that did not yet exist, such as “defining moment”, which was coined in the 1970s. Sometimes the latter type is more revealing, since we hope that scriptwriters are smart enough to avoid the first type. You will determine the frequency of two-word phrases in scripts written in the 1960s and set in the 1960s, and you will compare these to the frequency of two-word phrases in scripts written in the 21st century and set in the 1960s. If a phrase is much more common in the 21st-century scripts, then it is probably a prochronism.

This assignment was inspired by the Prochronisms blog by Benjamin Schmidt, and you will use some of the same datasets. (We thank him for sharing his data.) However, your analysis will be both more simplistic and more automated than that reported on the blog.

Reading and cleaning the scripts

To begin, download and extract the homework8.zip file.

Reading the scripts

We have provided you with authentic 1960s data, in directory 1960s: a corpus (a structured set of text) of 5 movie scripts and 115 episodes of The Twilight Zone. This data is assumed to be representative of speech from the 1960s.

We have also provided you with 4 collections of movie and TV scripts written recently, but set in the 1960s. The directory 21st-century contains subdirectories Mad_Men, Pan_Am, The_Kennedys, and X-Men_First_Class, each of which contains one or more scripts from a movie or TV show. You will individually compare each of these modern corpora (plural of “corpus”) to the 1960 corpus.

Your program should be named prochronisms.py, and should take two arguments on the command line, each of which is a path to a directory. The first directory contains the authentic corpus, and the second directory contains the test corpus. Your program should read and analyze the text from all the files within these directories. For example, you might invoke your program like this:

python prochronisms.py 1960s 21st-century/Mad_Men

If your program is not passed exactly 2 command line arguments, print a single line indicating what command-line parameters your program accepts, and exit.

You may assume:

Both command-line arguments passed to your program are valid paths.
Each directory passed to your program only contains script data files, and has no subdirectories.

Hint: You will need to learn how to process command-line arguments in Python. Use sys.argv.

Hint: You will need to learn how to determine all the files that a directory contains. Use os.listdir.

Hint: You will need to learn how to join file paths together. Use the os.path.join function. For example, os.path.join("mydir", "subdir") evaluates to 'mydir/subdir' on Linux/Mac and evaluates to 'mydir\\subdir' (which is a 12-character string) on Windows. If you use os.path.join, then your code will run on any operating system. If you hard-code a Windows directory separator \ or a Linux/Mac directory separator /, then your code will only run on that operating system and will break on other operating systems (including possibly the one that we use to grade your assignment).

For more information on interacting with files in Python, see this handout.

Extracting subtitle text

Each file in the directories contains subtitles in SubRip file format. Open one of the files in a text editor to see an example of SubRip's formatting. Each section consists of:

Subtitle number [an integer]
Start time --> End time
Text of subtitle (one or more lines)
[a blank line]

Our analysis will only use subtitle text, and will ignore all other information in the files. Design and implement an algorithm for extracting subtitle text from SubRip files. You may assume:

Any line in a SubRip file that contains only numeric characters ('0', '1', ... '9') does not contain subtitle text.
Any line that contains the string '-->' does not contain subtitle text.
Blank lines do not contain subtitle text.
Any line that does not meet the previous three criteria contains subtitle text, and should be included in your analysis.

Note: Don't use the pysrt package when manipulating SubRip files. This package is not documented, nor is it installed in your Python distribution. Instead, write your own code to extract subtitle text from SubRip files.

Cleaning subtitle text

The raw subtitle text contains words or formatting that we do not wish to include in our analysis. You should remove any and all occurrences of the following strings from the subtitle text, before further processing:

<i>
</i>
<font color=#00FF00>
<font color=#00FFFF>
<font color="#00ff00">
<font color="#ff0000">
</font>
#
-
(
)
\xe2\x99\xaa
www.AllSubs.org
http://DJJ.HOME.SAPO.PT/
Downloaded
Shared
Sync
www.addic7ed.com
n17t01
"

The last entry above indicates that you should remove all double-quote characters from the text.

We recommend that you remove the entries in the order given. In particular, if you remove # first, that will convert <font color=#00FF00> into <font color=00FF00>. At that point, trying to remove <font color=#00FF00> would have no effect, and you would have to do something more complicated to complete the cleaning process.

Do not clean the data further. For example, do not remove other punctuation, and do not change capitalization.

Splitting into words

Use Python's split() function without any arguments when splitting subtitle text into words. This will split on any sequence of whitespace characters. For the purposes of this assignment, we define a “word” to be any single element produced by split() when called on subtitle text. Thus, a “word” may include punctuation, capital letters, numbers, or any other assortment of characters that appears in a subtitle. However, a word will never contain a whitespace character.

Important: Be careful when determining whether a word is contained in a corpus. For example, note that the word 'turtles' is not contained in the corpus 'I like turtles.'. However, the word 'turtles.' is.

Comparing vocabulary (single words/1-grams)

Once you have read and cleaned the subtitle text, you will compare the words (also known as “one-grams” or “1-grams”) that appear in the authentic corpus with the words appearing in the test corpus.

Your program should print:

the 50 most frequent words (1-grams) in the test corpus that do not appear in the authentic corpus, and
the 50 most frequent words (1-grams) in the authentic corpus that do not appear in the test corpus.

Print the words in order from most to least frequent, or in lexicographic order if they are equally common.

A note on performance: The Python "in" operator is substantially faster when used with a set or dictionary, as compared to a list. We strongly recommend that you use sets or dicts, not lists, when testing whether a word is contained in a corpus. For example:

very_large_list = range(50000)
very_large_set = set(very_large_list)

# This executes slowly:
for x in range(50000):
  assert (x in very_large_list)

# This executes quickly:
for x in range(50000):
  assert (x in very_large_set)

Your program output should exactly follow the format below, but with actual 1-grams and frequencies:

1-grams: Test vs. Authentic
(('Trudy',), 30)
(('Heinz',), 22)
(("Hilton's",), 22)
... Insert more data here ...
(('Jaguar',), 18)


1-grams: Authentic vs. Test
(('creator',), 58)
(('unlock',), 58)
(('Rod',), 29)
... Insert more data here ...
(('santa',), 23)

Note that each row contains a string representation of a tuple of two elements; the first being a tuple of words, the second a frequency. We recommend you construct and print this tuple in your program, to avoid having to implement the output formatting yourself.

Words that appear in the test corpus but not in the authentic corpus might be considered suspicious. If you examine these words, you may notice that there are plausible explanation for most of them; some are proper names, others are vocabulary specific to the TV shows under analysis. This is an indication that ranking solely by frequency is unlikely to uncover suspicious text.

Augment your program so that it computes, for each word in each corpus, that word's normalized frequency. The normalized frequency of a word is equal to the number of times the word occurs in a corpus, divided by the total number of words in the corpus (including duplicate words). For example, in a corpus of 10,000 words, if a word appears 42 times, its normalized frequency would be .0042.

Now, for each word that appears in both corpora, compute the ratio of its test-corpus normalized frequency to its authentic-corpus normalized frequency. Use this ratio to rank 1-grams, much like you did for frequency earlier.

For example, suppose the word "Dog" appears 20 times in the test corpus of 1000 words, and 250 times in the authentic corpus of 10000. The test-corpus normalized frequency of "Dog" is 20/1000, or 0.02. The authentic-corpus normalized frequency of "Dog" is 250/10000, or 0.025. The ratio of test-corpus normalized frequency to authentic-corpus normalized frequency is 0.02/0.025, which equals 0.8.

Your program should print (in addition to the output specified earlier):

the 50 words with the highest ratio. These are words that were used commonly in the 21st-century scripts, but were extremely rare (but did get used) in 1960.
the sum of those 50 ratios
the 50 words with the lowest ratio. These are words that were used commonly in 1960, but were rare in the 21st-century scripts.
the sum of those 50 ratios

Your program output should exactly follow the format below, but with actual 1-grams, ratios and sums:

1-gram Ratios
Largest ratios:
(('Draper,',), 34.0)
(('Draper.',), 21.750000000000004)
... Insert more data here ...
(('Los',), 14.0)
Sum: 79.41027486717723

Smallest ratios:
(('lady,',), 0.145454545454545456)
(('ought',), 0.145454545454545456)
... Insert more data here ...
(('zone.',), 0.002424858757062147)
Sum: 9.14542487835206219

Compare the output from running your program on the Mad Men, Pan Am, The Kennedys, and X-Men First Class scripts. Based on your vocabulary analysis results, rank these four scripts from most realistic to least realistic, in terms of capturing 1960s language. In file answers.txt, write your ranking and a brief explanation (no more than one paragraph) justifying your ranking.

Comparing phrases (pairs of words/2-grams)

Even if all the words used in a text are characteristic of its setting, it is possible that some words are used together in uncharacteristic ways. Next, you will perform a more sophisticated linguistic analysis, on pairs of words (also called 2-grams or bigrams) that are used immediately after one another.

For example, this sentence contains the following 2-grams:

[('For', 'example,'),
 ('example,', 'this'),
 ('this', 'sentence'),
 ('sentence', 'contains'),
 ('contains', 'the'),
 ('the', 'following'),
 ('following', '2-grams:')]

Another example is the phrase “need to”, which is overused in the TV drama Downton Abbey:

Characters in Downton Abbey say “I must” 24 times, three times as often as they say “I need to.” Books from the period, on the other hand, say “I must” three hundred times as often; going by the printed literature, the Abbey's residents should “need to” do something about once every ten seasons, not once an episode.

Repeat the analysis that you performed for single words, but now using bigrams. There are some important details that your implementation will need to address:

Bigrams do not span between files. For example, if file1.srt ends with the line “The End.”, and file2.srt begins with the line “Hello World!”, the bigram ('End.', 'Hello') should not be included in your analysis.
Bigrams span between subtitle text lines in the same file. For example, the subtitle text lines 'This is not an exercise.' and 'Not an exercise?' appear one after the other in the file Dr.Strangelove.srt. The bigram spanning these lines, ('exercise.', 'Not'), should be included in your analysis.
Your program should discard every bigram that contains a word that appears in only one corpus. For example, suppose we have two corpora, c1 and c2. Each consists of a single line, represented in Python below:

c1 = "We all live in a yellow submarine."
c2 = "A live yellow submarine lifeform in a sea."

c1 and c2 are transformed into the following sequences of bigrams:

c1_bigrams = [('We', 'all'), ('all', 'live'), ('live', 'in'), ('in', 'a'),
              ('a', 'yellow'), ('yellow', 'submarine.')]

c2_bigrams = [('A', 'live'), ('live', 'yellow'), ('yellow', 'submarine'),
              ('submarine', 'lifeform'), ('lifeform', 'in'), ('in', 'a'),
              ('a', 'sea.')]

Consider the bigram ('live', 'yellow') from c2. Because the words 'live' and 'yellow' both appear in c1, this bigram should be included in your analysis. Conversely, the bigram ('yellow', 'submarine.') from c1 should be excluded from your analysis, because the word 'submarine.' does not appear in c2. Note that 'submarine' (without a period) does appear in c2.

Here are the bigrams from each corpus that should be included in your analysis:

c1_analysis = [('live', 'in'), ('in', 'a'), ('a', 'yellow')]

c2_analysis = [('live', 'yellow'), ('in', 'a')]

Hint: Earlier, you analyzed the set of words appearing in each corpus. It may be helpful to re-use this data when deciding what bigrams to exclude.

Bigram output

Your program should print, (in addition to the output specified earlier):

the 50 most frequent bigrams in the test corpus that do not appear in the authentic corpus, and
the 50 most frequent bigrams in the authentic corpus that do not appear in the test corpus.

Print the bigrams in order from most to least frequent, or if they are equally common, in lexicographical order by the first word in the bigram, then the second. Note that this is the default sorting order of tuples containing strings in Python.

Your program output should follow the format below (but the below data is made up, and your program will print actual 2-grams and frequencies):

2-gram Frequency: Test vs. Authentic
(('Previously', 'on'), 31)
(('I', 'supposed'), 20)
... Insert more data here ...
(('Don', 'and'), 14)

2-gram Frequency: Authentic vs. Test
(('twilight', 'zone.'), 289)
(('dimension', 'of'), 95)
... Insert more data here ...
(('sight', 'and'), 77)

When calculating the normalized frequency of a bigram, divide by the number of (non-discarded) bigrams, not by the number of words.

Your program should print (in addition to the output specified earlier):

the 50 bigrams with the highest ratio of test-corpus normalized frequency to authentic-corpus normalized frequency.
the sum of the 50 ratios.
the 50 bigrams with the lowest ratio.
the sum of the 50 ratios.

Your program output should exactly follow the format below, but with actual 2-grams, ratios and sums:

2-gram Ratios
Largest ratio:
(("That's", 'true.'), 27.0)
(("I'm", 'supposed'), 24.999999999999996)
... Insert more data here ...
(('the', 'train.'), 16.0)
Total:  1047.1477666666666

Smallest ratio:
(("it's", 'only'), 0.05)
(('you', 'see?'), 0.05)
... Insert more data here ...
(('a', 'most'), 0.02777777777777778)
Total:  2.04881766666662

Compare the output from running your program on the Mad Men, Pan Am, The Kennedys, and X-Men First Class scripts. Based on the bigram analysis results (only), rank these four scripts from most realistic to least realistic, in terms of capturing 1960s language. In file answers.txt, write your ranking and a brief explanation (no more than one paragraph) justifying your ranking.

Compare the vocabulary analysis to the bigram analysis. Were the results consistent? Why or why not? What sort of mistakes is each one most effective at discovering? Which one do you think is more accurate? In file answers.txt, write a brief explanation (no more than two paragraphs, but one will suffice).

Finally, suggest three ways in which the analysis could be improved. These are not improvements to your code, but improvements to the specification that your code implements. For each change, give one sentence explaining the change, one sentence stating how the change might improve the analysis, and one sentence stating how the change might degrade the analysis. Be sure to be concrete about how you would implement the change; don't just state a goal such as “remove non-useful words” without explaining how you would do that or what a “non-useful word” is.

Here is an example of an acceptable answer (you should suggest three changes that are different from this one):

Standardize spelling, by lowercasing every word and removing all punctuation. This would ensure that occurrences of a word at the beginning, middle, and end of a sentence or clause would be treated uniformly, rather than creating different words with different capitalization or punctuation. A disadvantage is that this would create bigrams across sentences, which are not really co-occurring uses of the two words.

Verifying your solution

To help you validate the correctness of your implementation, we have provided you with the expected output when your program is run on the following data:

python prochronisms.py 21st-century/Mad_Men/ 21st-century/Pan_Am/

The expected output can be found here. You may find this online text comparison tool useful when determining whether your output matches ours.

Your program output must exactly match the expected output above. Producing this output does not necessarily mean your program is free of bugs, and is not a substitute for thoroughly testing and validating your code. However, it is a good indication that you are on the right track.

Submit your work

You are almost done!

Look over your work to ensure that you used good coding style: functions as appropriate, no unnecessary repetition, clear documentation, appropriate use of global constants, etc.

Write the entire output of your program when comparing Mad Men with the 1960s corpus to the file output.txt. This can be easily done by executing the following command at the command-line shell:

python prochronisms.py 1960s 21st-century/Mad_Men > output.txt

At the bottom of your answers.txt file, in the “Collaboration” part, state which students or other people (besides the course staff) helped you with the assignment, or that no one did.

At the bottom of your answers.txt file, in the “Reflection” part, reflect on this assignment. What did you learn from this assignment? What do you wish you had known before you started? What would you do differently? What advice would you offer to future students?

Answer a survey about how much time you spent on this assignment.

Submit the following files via the A8 Turn-in Page.

prochronisms.py
answers.txt
output.txt