Assignment 7 for CSE 415 (Autumn 2017)

Assignment 7: AI vs Fake News

CSE 415: Introduction to Artificial Intelligence
The University of Washington, Seattle, Autumn 2017

Overview:

In this assignment, you'll explore the topic of text classification using AI techniques. The assignment integrates several important AI topics: natural language processing (NLP), machine learning (through classifier training), and probabilistic inference (through the Naive Bayes technique). This assignment is an individual-work assignment (no partnerships).

First, you'll set up your tools and build a classifier that distinguishes textual messages about medicine from textual messages about autos. This is a fairly straightforward exercise that helps you become familiar with the tools.

After that, you'll try to solve the Fake News classification problem. It's a tough problem, and your job is not necessarily to solve it, but to try to determine empirically which among some standard methods seem to do better or worse on the problem. As the online resources suggest, this is an information-processing problem of huge importance.

This second part of the assignment is open-ended. You must use at least two different classification techniques and apply them to either the fake news dataset listed in resource b below or another dataset that you find yourself. As with other wicked problems, the fake news problem can be difficult to formulate, let alone solve. Some approaches are based only on the "message" part of data items, whereas others use metadata in addition to the message. Metadata may include IP addresses within the headers of messages, dates, senders' or authors' names, etc.

Part A is due Wednesday, November 29 via Catalyst CollectIt at 11:59 PM.

Part B is due Wednesday, December 6 via Catalyst CollectIt at 11:59 PM.

PART A:. Basics of Text Classification

Your main resources for Part A are:

Anaconda. (Download the Python 3.6 version.)
scikit-learn. (Use Anaconda to download it.) The scikit-learn software provides access to the 20newsgroups dataset.

What to Turn In For Part A:

For Part A, You'll turn in a Jupyter notebook that includes a confusion matrix for medical and automobile related messages as classified by a Naive Bayes classifier.

Part A is due Wednesday night, November 29.

PART B:. Distinguishing News from Fake News

Your main resources for Part B are (in addition to the Part A resources):

"How can Machine Learning and AI Help Solving the Fake News Problem?" at https://miguelmalvarez.com/2017/03/23/how-can-machine-learning-and-ai-help-solving-the-fake-news-problem/
The online tutorial: "Detecting Fake News with Scikit-Learn" at https://www.datacamp.com/community/tutorials/scikit-learn-fake-news

What to Turn In For Part B:

For Part B, You'll mainly turn in a report (e.g., in Word or PDF form) that describes your approach to detecting fake news articles and what results you got. Then as an appendix, you'll include a Jupyter notebook that records the steps you took to investigate the issue. The report itself should include these elements, and using the same numbered headings as follows:

Cover page that includes title, course and assignment numbers,
 University of Washington, Seattle, Autumn Quarter,
 and your name.
1. Introduction
 Describe what the fake news problem is, in general.
2. Formulation
Your formulation of the fake news problem in terms of training data,
test data, what kinds of information are included in the data
(news articles, vs Twitter tweets, email, metadata and what 
kind, advertisements, etc.)
3. Techniques Used
Name the classification methods you used.
Explain the first method; in a paragraph describe how
it works.
Explain the second method; in a paragraph describe how
it works, too.
Compare the two methods in terms of characteristics such as
training time, ways to control overfitting, and how the
main data structures compare in form.
4. Training and Testing Data Used
If you used a standard database, name it and describe it, including
where its data comes from, the form of each item, how if was labeled,
its size in number of examples, and average size in number of characters
or bytes for each example.
5. Experiments
For each classifier, how did you split the data into training and
testing sets?  How many runs were there?
6. Results
Give a numeric table that includes numbers of training and
test examples, classification scores on each group by
each method (true positives, true negatives, false positives, false negatives),
and F scores.
7. Discussion
What are the main challenges of the fake news problem?
Which classifier seemed better or best in your experiments?
What additional information, features, or techniques can you 
imagine that would allow classifiers to do better on it?
8. Personal Retrospective
What did you learn doing this assignment? (a paragraph of 
at least 3 sentences and up to half a page)
9. References
List all the resources you used significantly, including those
mentioned in the assignment as well as others you found, either
online or elsewhere.
Appendix A
Your Jupyter notebook, as a text (or text with pictures) document
included as an extra section of the report.

Part B is due Wednesday night, December 6.

Updates and Corrections:

The due date for Part B was corrected from Dec. 5 to Dec. 6 (Wednesday night). If necessary, additional updates and corrections will be posted here and/or mentioned in class, in GoPost, or via the mailing list.

Feedback Survey

TBA