CSE 454 Project Specification: WikiTruthiness

Group members

Katherine Baker
David Koenig
Aaron Miller
Cullen Walsh

Our problem and our goal

We would like to look at how contentious parts of Wikipedia articles are. Wikipedia allows users the ability to get the full history of any article on the site, and we are interested in seeing what paragraphs (or possibly sentences) have seen the most reversions, edit wars, or other indicators that a certain piece of information is contentious.

Components of our project

Here is a list of all the components that we are going to need to implement. They are in approximate dependency order: unless otherwise listed, a task depends upon previously listed tasks for full functioning, although it may be able to be developed to some extent before previous parts have finished.

Set up a database to store the database as well as design the schema that we are going to use. This will likely happen concurrently with the scraper (see the next point), as we are not able to build the schema until we know what data is being stored.
We need to build a scraper, the component which imports data from Wikipedia. This will have an API that takes as input a page to import. It will then make requests to the Wikipedia API to get the entire history of a page, and will put the page history in the database so that we can look at it later (likely doing some processing along the way). This depends upon the presence of the database and the schema, although the development of this piece will likely be a large part of the development of the schema.
We will likely need to compute the graph of a page's history: which paragraphs survived an edit (were not changed by it), which were modified in the edit, and which were added or deleted by an edit. This depends on the previous points, as we will need some sample data to test it.
We need a task queue that continuously reads names of articles to retrieve, and then calls the scraper on each of them. It will integrate with the user interface, which will know the pages that users are requesting, and may also take pages emitted from the scraper runs. (This may be useful, for example, for prefetching articles linked from pages being viewed by users.) This lightly depends on the scraper to determine what interfaces we should call and provide.
We will need to develop a user interface able to show an article and some of its edit history. This will likely be similar to the viewing of a Wikipedia article directly, but will use color or other indicators to show contentious parts of the article (e.g. a paragraph with a recent edit war will be a brighter read, and this red would be dimmer if the edit war was less recent or did not last for very long). This depends on the graph-building functionality as well as the scraper queue (as users will need to be able to ask on the fly for a particular page).

Expected schedule

Here is what we expect to have at each of the main milestones:

Milestone 1: 29 October

Specification is finished and complete
Database is set up, although we are still free to have some schema changes at this point
Scraper is reaching completion

Milestone 2: 17 November

Scraper task queue should be close to done
Most algorithmic focus at this point is on building the graph that represents a page
UI development is progressing
Start to develop experiments at this point

Code Complete: 3 December

Experiment specifications should be done at this point; start performing experiments
Start writing final report
Start developing final presentation

Presentation: 14 December

Experiments and documentation all finished and turned in

Use of machine learning

We will need many examples of pages that are common targets of edit wars. Most of these pages will likely be easy to find; Wikipedia keeps manual lists of articles over which arbitration is taking place, and also has a list of lamest edit wars which we can enter manually.

Measurement of success

As one measurement, we will keep aside a selection of our machine learning-related data as a validation set that we can use.

In addition, we would like to ensure that the application is sufficiently easy to use, based on interviews with users. In particular, there should be a few users surveyed who are not from technical majors (engineering, math, hard sciences, and the like).

We would also like to ensure that the performance of our application is adequate. Requests should be handled within the normal timeframe of a web request (ten seconds in the absolute worst case). The application should make use of optimistic prefetching where appropriate.