CSE 454 Project Specification: WikiTruthiness
Group members
- Katherine Baker
- David Koenig
- Aaron Miller
- Cullen Walsh
Our problem and our goal
We would like to look at how contentious parts of Wikipedia articles
are. Wikipedia allows users the ability to get the full history of any
article on the site, and we are interested in seeing what paragraphs (or
possibly sentences) have seen the most reversions, edit wars, or other
indicators that a certain piece of information is contentious.
Components of our project
Here is a list of all the components that we are going to need to
implement. They are in approximate dependency order: unless otherwise
listed, a task depends upon previously listed tasks for full
functioning, although it may be able to be developed to some extent
before previous parts have finished.
- Set up a database to store the database as well as design the
schema that we are going to use. This will likely happen
concurrently with the scraper (see the next point), as we are not
able to build the schema until we know what data is being stored.
- We need to build a scraper, the component which imports data
from Wikipedia. This will have an API that takes as input a page to
import. It will then make requests to the Wikipedia API to get the
entire history of a page, and will put the page history in the
database so that we can look at it later (likely doing some
processing along the way). This depends upon the presence of the
database and the schema, although the development of this piece will
likely be a large part of the development of the schema.
- We will likely need to compute the graph of a page's history:
which paragraphs survived an edit (were not changed by it), which
were modified in the edit, and which were added or deleted by an
edit. This depends on the previous points, as we will need some
sample data to test it.
- We need a task queue that continuously reads names of articles
to retrieve, and then calls the scraper on each of them. It will
integrate with the user interface, which will know the pages that
users are requesting, and may also take pages emitted from the
scraper runs. (This may be useful, for example, for prefetching
articles linked from pages being viewed by users.) This lightly
depends on the scraper to determine what interfaces we should call
and provide.
- We will need to develop a user interface able to show an article
and some of its edit history. This will likely be similar to the
viewing of a Wikipedia article directly, but will use color or other
indicators to show contentious parts of the article (e.g. a
paragraph with a recent edit war will be a brighter read, and this
red would be dimmer if the edit war was less recent or did not last
for very long). This depends on the graph-building functionality as
well as the scraper queue (as users will need to be able to ask on
the fly for a particular page).
Expected schedule
Here is what we expect to have at each of the main milestones:
Milestone 1: 29 October
- Specification is finished and complete
- Database is set up, although we are still free to have some
schema changes at this point
- Scraper is reaching completion
Milestone 2: 17 November
- Scraper task queue should be close to done
- Most algorithmic focus at this point is on building the graph
that represents a page
- UI development is progressing
- Start to develop experiments at this point
Code Complete: 3 December
- Experiment specifications should be done at this point; start
performing experiments
- Start writing final report
- Start developing final presentation
Presentation: 14 December
- Experiments and documentation all finished and turned in
Use of machine learning
We will need many examples of pages that are common targets of edit
wars. Most of these pages will likely be easy to find; Wikipedia keeps
manual lists of articles over which arbitration is taking place, and
also has a list of lamest edit wars
which we can enter
manually.
Measurement of success
As one measurement, we will keep aside a selection of our machine
learning-related data as a validation set that we can use.
In addition, we would like to ensure that the application is
sufficiently easy to use, based on interviews with users. In particular,
there should be a few users surveyed who are not from technical majors
(engineering, math, hard sciences, and the like).
We would also like to ensure that the performance of our application
is adequate. Requests should be handled within the normal timeframe of
a web request (ten seconds in the absolute worst case). The application
should make use of optimistic prefetching where appropriate.