At this point in the quarter, it's time to begin thinking about what to do for your final projects. While we encourage your own ideas, those of you who are stuck, or are looking for a way to connect with some faculty members in the department may find the following ideas helpful.
Some faculty members have provided suggestions for project ideas. They are posted below. If you are interested in working on one of these ideas, let Aaron know which one you'd like to work on (via email) before 5:00 pm on Thursday 4/19 (not Monday 4/23 as stated earlier). At that point, we'll aggregate the students interested in each idea into a group and have you set up a meeting with the faculty members to develop your formal proposals. Teams of students who want to work on the same project are encouraged, but if you don't have teammates they can be arranged as well.
The projects are listed below in no particular order.
Journal paper to BibTeX citation index:
Basically, my office used to fill up with xeroxes of journal papers that I couldn't keep track of. Now my hard drive is filling up with pdf's of journal papers that I can't keep track of. I don't think I'm alone. So my problem in a nutshell is, given a pdf file of a journal article downloaded directly from the publisher, return a citation to it in bibtex, endnote or other organized format.
One could try parsing the pdf & trying to identify author, journal, etc., but there are 10,000 formats. A reverse approach seems easier:
A couple of pragmatic problems: some journals, e.g. Science, are starting to wrap gunk around their pdfs including a header page, "downloaded 3/17/07", and even advertising, so some tailored pdf parsing may be needed in these cases. I don't think this is widespread yet, so could be ignored initially, but needs investigation to see how widespread it is. Also, journals robots.txt may well request that you not suck out all their pdfs. Given that the result doesn't duplicate or redistribute their content (but augments it), I don't think it would be a problem, but maybe I'm naive.
Automatic Resolution of Software Documentation to Version Numbers
Problem: When I use search engines like Google to look for information about some software package (e.g., Linux itself), what I probably want to know about is the latest version of the software. What I typically get are pages of information that gives no indication what version it's talking about. Moreover, there's no easy way (that I know of) to even tell when the page was last modified, as a simple manual filter to see if it could possibly be about the package version I'm interested in. The standard Google page ranking algorithm makes this especially bad, since I want new pages and what I tend to get are old pages that have accumulated links to them. (Adding the version I want to the list of search keys isn't very useful -- most page authors don't seem to bother to say what version they're talking about, and the semantics of versioning mean I don't really want an exact match.)
Implementation suggestions: Here are some solution approaches that come to mind. I presented the problem as one applying to code, but it's easy to see that there is a more general context. I think all the suggestions could apply to either the more general context or else could be customized to specific other contexts.
Solution Approach 1: Just give me an option to inject recency into the page
ranking algorithm. In fact, even just 'most recently modified first'
would be helpful.
How to get recency information? There used to be some kind of last-modified option in HTTP. I don't know if it's still there, or if it's ever used. Another way is to diff the results of successive crawls -- did the page change? I'm guessing that wouldn't be too convenient for us, because we probably have only a single crawl.
Solution Approach 2: Do some text analysis to try to find version numbers in the page content, and give me some query syntax to talk about version number (in some sensible way -- "more recent than", "exactly", etc.)
Solution Approach 3: Use analysis of multiple pages to try to infer what version a particular page must be about by its content other than explicit version number. For instance, if a page with no version number info mentions interface foo(bool,int), use mention of foo(bool,int) on other pages that do have version number info to infer the earliest version this page could be about.
Solution Approach 4: Use other information that might be commonly available, e.g., change logs, or even checkouts of the source itself, to try to relate a page's contents to a version number.
Table-Based Information Search
how about collecting and indexing all the HTML tables on the Web, or at least a million or so?
This would likely be something similar to a "google images" style idea... where it tries to use local context to impart meaning to tables it finds.
Implementation suggestions: You can use text in the table as well as text surrounding the table to handle the search. Imagine that one M/R phase that extracts the tables from text, then another builds the index based on items contained in the extracted table.
You'd also need some kind of classifier that distinguishes content-tables from layout-tables. I don't think that would be too hard to write - a simple Naive Bayes classifier would probably work.
Relevant Course Source Code File Search
Here are a couple of searches I could have used some machine help on recently:
The sources consist of a collection of web directories plus directories on a shared nt file system plus some local files on my machine. Even something that could find all the code fragments that resemble a sample would be useful (matches similar to the moss system that we use to detect cheating).
Not sure this counts as big-scale cluster computing, but it does seem to be beyond what windows desktop or other local searches can do.
This would be best if it can also search through powerpoint files, as well as word, pdf, html, .java, .c, .cpp, etc.
We've got a lot of old files and web sites from the intro courses going back at least 10 years that you could probably use as a source. We might need to scrub them to be sure they don't contain grades, student id numbers, or other confidential stuff, but that ought to be doable.
Ubiquitous Computing Data Mining
Gaetano and a post-doc have a bunch of ideas about how they could come up with some projects with large data sets from our sensor board work. There's a couple of interesting things in: