Philippe Sekine Final Project Individual Proposal Project: There's no such thing as bad publicity. With all the news coverage of the upcoming presidential election, I thought it would be interesting to track the coverage of the canditates. The project will attempt to track the number of times each candidate is mentioned in online articles, or perhaps more useful, the number of articles written about each candidate. We could also partition this data in different ways to get other insight on the election and the news sources. For example, we could use the time of publication to determine if the news coverage has gone up or down for different candidates throughout the year. Alternatively, we could look at the articles written by just one news souce, and see if they publish signifantly more articles about certain candidates. The most difficult part of the project will be gathering a large data set from news sites. I believe that nutch could be used to crawl the web and gather the source code for various sites, and we would ideally like to restrict the crawl to certain news sources (CNN, newyorktimes.com, MSNBC?, FOXnews.com?...). If these sites are already compartmentalizing their election coverage, then it could make the crawl potentially easier. There is a concern that if we are just looking at last names, we could potentially pull in unrelated articles, and we will have to figure out a way to deal with that. There are other concerns such as pronoun use which could lead to an underrepresentaion of the times they are mentioned, journalists tend to avoid excessive pronouns anyway since they often have no control over layout and pagebreaks, and so I doubt it will lead to large inaccuracies.