CSE 574 Project Ideas

University of Washington
Department of Computer Science and Engineering

CSE574 – Project Ideas

Visual query interface to StruQL One of the main problems Strudel users will encounter is the need to write long queries in a text based query language. Developing a visual interface to StruQL raises several interesting questions. What is the right metaphor to use in such an interface? Can some ideas be used from interfaces to traditional query languages (e.g., 'query by example' to SQL)? Can such an interface be helpful in guiding the user in creating the site and in enforcing constraints? Such an interface is very likely to be incorporated in Strudel and used widely.
Restructuring web-sites created with Strudel Etzioni and Perkowitz discuss ``Adaptive web sites''. One of their first works (AAAI 98 submission) is to build index pages, based on analyzing visits of users to web-sites. This project would involve studying the same problem, but assuming the web-site is built with Strudel, and therefore, the representation of the structure can be leveraged to create interesting restructurings.
Modeling state in Strudel Thus far, the model of the data in Strudel is very static. That is, the site-graph models data, but not behavior. How can we extend the ideas of Strudel to incorporate state. This may be required for building web-sites that perform transactions with users, or web-sites whose behavior is dependent on the browsing pattern of the user. Incorporating forms is a simple version of this problem on which some work is already being done in Strudel. This project would extend that work.
Migration of web-sites to Strudel This project is a little less clearly defined. One of the main problems of Strudel is that one must create a model of a web-site, and then build it. However, many web-sites are built in an ad-hoc way, with no model in mind. The question to explore in this project is how we can support a web-site builder in evolving a model for a web-site and migrating existing parts of a web-site into such a model. This is a very hard problem, but solutions will be very useful.
Learning Web Site Relations See the discussion (pages 13-15) in the Craven paper (3.6) for a list of ideas.
Build a web-page classification system Using the Bayesian scheme of the Freitag paper (3.5) or one of the learning algorithms described in the Syskill and Webert paper (5.1) build a machine learning system that outputs a classifier for a class of web pages. Train it on examples from different parts of the Yahoo! taxonomy and measure its performance.
Empirical study of query containment: Query containment algorithms have been studied so far exclusively from a theoretical perspective. However, in many more 'applied' papers, ones sees references to such algorithms. People believe that techniques for query containment can be useful in many contexts: optimization, information integration, determining query independence of updates, inferring completeness of answers in the presence of incomplete information, semantic caching, maintaining physical data independence, view update problems, etc.
The gap between the theoretical study and the implementation of these techniques is due to the fact that no one has performed an empirical study of containment algorithms. Such a study can have considerable impact, and yield a highly useful tool.
What is needed is to study different implementations of query containment algorithms and study their performance. One can start with studying the simple cases of conjunctive queries and unions thereof, and then add interpreted predicates (e.g., less than).
The project is likely to also lead to new algorithms for query containment, motivated by practical considerations. Also see the next idea:
Empirical study of algorithms for rewriting queries using views: The same as the previous project idea, except for rewriting queries using views. Here there are additional dimensions, such as whether we are looking for rewritings of the query that are equivalent to the original query, or ones that are maximally-contained in the original query. Another interesting question is the presence of completeness information about the views. This project can also feed off the first project, though it is possible to make the underlying containment algorithms a parameter in this study. Furthermore, a good implementation of rewriting queries using views can lead to an improved information integration system, improving Razor and IM.
Using probabilistic information in data integration: In the projects we have discussed thus far, contents of data sources were described qualitatively. For example, the mapping may be specified by a set of rules defining the mediated schema in terms of the data sources (as in Tsimmis), or as a set of queries defining the source contents as queries over the mediated schema (as in Razor and IM). However, suppose we are trying to find a paper on some topic in computer science, and we have a set of sources, each described by some area of CS it is covering. Since no two topics in CS are disjoint from each other, all the systems so far would have to search *all* the possible sources. However, we'd like to do better than that. We'd like at least to order the access to sources. For example, if we're searching for papers on data-integration, we search the sources covering AI and DB, before we consider those on operating systems or architecture.
In a recent paper, Florescu et al. proposed a framework and algorithm for adding probabilistic information to data integration. In particular, not only does your domain model (i.e., mediated schema) contain topics in CS, but also information about the overlap between areas in the field. Hence, the data integration system can order the access to the sources.
The project would consist of building a search engine for papers in CS (or any other field, or, for that matter, any other domain of interest) based on the ideas in that paper. As usual, trying to implement paper ideas lead to finding many holes and additional research problems. At the very worst, this can lead to the best paper search engine on the web!
Recursive planning: Initially motivated by the [Kwok and Weld, 96] paper, [Duschka and Levy,97] propose an algorithm for recursive planning for information gathering. This algorithm is now used in Razor [Friedman and Weld, 97].
These works considered the idea of recursive planning solely in the context of data integration. The idea of this project is to take a step back and try to formalize a more general framework of recursive planning, in the wider context of AI planning. Some questions to ask are: (1) how can we generalize the formalization of the AI planning problem to recursive planning? (2) How can we extend planning algorithms to create recursive plans? (3) what are other examples of domains in which recursive planning can yield more succinct plans? (4) How do we distinguish recursive planning and the more general, hopeless problem of automatic programming?
Wrapper Expressiveness Investigate the relative expressiveness of different wrapper languages. How do the tree parsing codes (without arbitrary PERL, of course) of Cohen's SPIRAL system (paper 1.3) compare to HLRT and the other languages considered by Kushmerick (paper 3.3)? Could one learn Cohen's representation?
Query RoutingUse WordNet or some other technique (Bayes network?) to automate and improve query routing for a system like search broker.
AdBot Build a system that automatically filters out unwanted advertising banners. Use machine learning to learn a classifier that recognizes ad banners and filters them out of a source page. The input to the learner would be a web page; training examples would be the images on the page, and whether or not they represent unwanted ads. The attributes used by the learner could include features such as the URL in the HREF enclosing the IMG tag, the SRC of the IMG, the width and height of the image, maybe the location on the page. (Suggestion by Tessa Lau).

weld@cs.washington.edu

University of Washington Department of Computer Science and Engineering

CSE574 – Project Ideas

University of Washington
Department of Computer Science and Engineering