Assignment 7: AI vs Fake News |
CSE 415: Introduction to Artificial Intelligence The University of Washington, Seattle, Autumn 2017 |
Overview:
In this assignment, you'll explore the topic of text classification using AI techniques. The assignment integrates several important AI topics: natural language processing (NLP), machine learning (through classifier training), and probabilistic inference (through the Naive Bayes technique). This assignment is an individual-work assignment (no partnerships). First, you'll set up your tools and build a classifier that distinguishes textual messages about medicine from textual messages about autos. This is a fairly straightforward exercise that helps you become familiar with the tools. After that, you'll try to solve the Fake News classification problem. It's a tough problem, and your job is not necessarily to solve it, but to try to determine empirically which among some standard methods seem to do better or worse on the problem. As the online resources suggest, this is an information-processing problem of huge importance.
This second part of the assignment is open-ended. You must use at least
two different classification techniques and apply them to either
the fake news dataset listed in resource b below or another
dataset that you find yourself. As with other wicked problems,
the fake news problem can be difficult to formulate, let alone solve.
Some approaches are based only on the "message" part of data items,
whereas others use metadata in addition to the message. Metadata
may include IP addresses within the headers of messages, dates,
senders' or authors' names, etc.
|
Part A is due Wednesday, November 29 via Catalyst CollectIt at 11:59 PM.
Part B is due Wednesday, December 6 via
Catalyst CollectIt
at 11:59 PM.
|
PART A:.
Basics of Text Classification
Your main resources for Part A are:
|
What to Turn In For Part A:
For Part A, You'll turn in a Jupyter notebook that includes a confusion matrix for medical and automobile related messages as classified by a Naive Bayes classifier.
Part A is due Wednesday night, November 29.
|
PART B:.
Distinguishing News from Fake News
Your main resources for Part B are (in addition to the Part A resources):
|
What to Turn In For Part B:
For Part B, You'll mainly turn in a report (e.g., in Word or PDF form) that describes your approach to detecting fake news articles and what results you got. Then as an appendix, you'll include a Jupyter notebook that records the steps you took to investigate the issue. The report itself should include these elements, and using the same numbered headings as follows: Cover page that includes title, course and assignment numbers, University of Washington, Seattle, Autumn Quarter, and your name. 1. Introduction Describe what the fake news problem is, in general. 2. Formulation Your formulation of the fake news problem in terms of training data, test data, what kinds of information are included in the data (news articles, vs Twitter tweets, email, metadata and what kind, advertisements, etc.) 3. Techniques Used Name the classification methods you used. Explain the first method; in a paragraph describe how it works. Explain the second method; in a paragraph describe how it works, too. Compare the two methods in terms of characteristics such as training time, ways to control overfitting, and how the main data structures compare in form. 4. Training and Testing Data Used If you used a standard database, name it and describe it, including where its data comes from, the form of each item, how if was labeled, its size in number of examples, and average size in number of characters or bytes for each example. 5. Experiments For each classifier, how did you split the data into training and testing sets? How many runs were there? 6. Results Give a numeric table that includes numbers of training and test examples, classification scores on each group by each method (true positives, true negatives, false positives, false negatives), and F scores. 7. Discussion What are the main challenges of the fake news problem? Which classifier seemed better or best in your experiments? What additional information, features, or techniques can you imagine that would allow classifiers to do better on it? 8. Personal Retrospective What did you learn doing this assignment? (a paragraph of at least 3 sentences and up to half a page) 9. References List all the resources you used significantly, including those mentioned in the assignment as well as others you found, either online or elsewhere. Appendix A Your Jupyter notebook, as a text (or text with pictures) document included as an extra section of the report.
Part B is due Wednesday night, December 6.
|
Updates and Corrections:
The due date for Part B was corrected from Dec. 5 to Dec. 6 (Wednesday night).
If necessary, additional updates and corrections will be posted here and/or mentioned in class, in GoPost,
or via the mailing list.
|
Feedback Survey
TBA |