CSE574 – Project Ideas
- Visual query interface to StruQL
One of the main problems Strudel users will encounter is the need to
write long queries in a text based query language. Developing a visual
interface to StruQL raises several interesting questions. What is the
right metaphor to use in such an interface? Can some ideas be used
from interfaces to traditional query languages (e.g., 'query by
example' to SQL)? Can such an interface be helpful in guiding the user
in creating the site and in enforcing constraints? Such an interface
is very likely to be incorporated in Strudel and used widely.
- Restructuring web-sites created with Strudel
Etzioni and Perkowitz discuss ``Adaptive web sites''. One of their
first works (AAAI 98 submission) is to build index pages, based on
analyzing visits of users to web-sites. This project would involve
studying the same problem, but assuming the web-site is built with
Strudel, and therefore, the representation of the structure can be
leveraged to create interesting restructurings.
- Modeling state in Strudel
Thus far, the model of the data in Strudel is very static. That is,
the site-graph models data, but not behavior. How can we extend the
ideas of Strudel to incorporate state. This may be required for
building web-sites that perform transactions with users, or web-sites
whose behavior is dependent on the browsing pattern of the
user. Incorporating forms is a simple version of this problem on which
some work is already being done in Strudel. This project would extend
that work.
- Migration of web-sites to Strudel
This project is a little less clearly defined. One of the main
problems of Strudel is that one must create a model of a web-site, and
then build it. However, many web-sites are built in an ad-hoc way,
with no model in mind. The question to explore in this project is how
we can support a web-site builder in evolving a model for a web-site
and migrating existing parts of a web-site into such a model. This is
a very hard problem, but solutions will be very useful.
- Learning Web Site Relations
See the discussion (pages 13-15) in the Craven paper (3.6) for a list of
ideas.
- Build a web-page classification system
Using the Bayesian scheme of the Freitag paper (3.5) or one of the learning
algorithms described in the Syskill and Webert paper (5.1) build a machine
learning system that outputs a classifier for a class of web pages. Train
it on examples from different parts of the Yahoo! taxonomy and measure its
performance.
- Empirical study of query containment:
Query containment algorithms have been studied so far exclusively
from a theoretical perspective. However, in many more 'applied'
papers, ones sees references to such algorithms. People believe that
techniques for query containment can be useful in many contexts:
optimization, information integration, determining query independence
of updates, inferring completeness of answers in the presence of
incomplete information, semantic caching, maintaining physical data
independence, view update problems, etc.
The gap between the theoretical study and the implementation of these
techniques is due to the fact that no one has performed an empirical
study of containment algorithms. Such a study can have considerable
impact, and yield a highly useful tool.
What is needed is to study different implementations of query
containment algorithms and study their performance. One can start with
studying the simple cases of conjunctive queries and unions thereof,
and then add interpreted predicates (e.g., less than).
The project is likely to also lead to new algorithms for query
containment, motivated by practical considerations. Also see the next idea:
- Empirical study of algorithms for rewriting queries using views:
The same as the previous project idea, except for rewriting queries using
views. Here there are additional dimensions, such as whether we are looking
for rewritings of the query that are equivalent to the original query, or
ones that are maximally-contained in the original query. Another
interesting question is the presence of completeness information about the
views. This project can also feed off the first project, though it is
possible to make the underlying containment algorithms a parameter in this
study. Furthermore, a good implementation of rewriting queries using views
can lead to an improved information integration system, improving Razor and
IM.
- Using probabilistic information in data integration:
In the projects we have discussed thus far, contents of data sources
were described qualitatively. For example, the mapping may be
specified by a set of rules defining the mediated schema in terms of
the data sources (as in Tsimmis), or as a set of queries defining the
source contents as queries over the mediated schema (as in Razor and
IM). However, suppose we are trying to find a paper on some topic in
computer science, and we have a set of sources, each described by some
area of CS it is covering. Since no two topics in CS are disjoint from
each other, all the systems so far would have to search *all* the
possible sources. However, we'd like to do better than that. We'd like
at least to order the access to sources. For example, if we're
searching for papers on data-integration, we search the sources
covering AI and DB, before we consider those on operating systems or
architecture.
In a recent paper, Florescu et al. proposed a framework and algorithm
for adding probabilistic information to data integration. In
particular, not only does your domain model (i.e., mediated schema)
contain topics in CS, but also information about the overlap between
areas in the field. Hence, the data integration system can order the
access to the sources.
The project would consist of building a search engine for papers in
CS (or any other field, or, for that matter, any other domain of
interest) based on the ideas in that paper. As usual, trying to
implement paper ideas lead to finding many holes and additional
research problems. At the very worst, this can lead to the best paper
search engine on the web!
- Recursive planning:
Initially motivated by the [Kwok and Weld, 96] paper, [Duschka and
Levy,97] propose an algorithm for recursive planning for information
gathering. This algorithm is now used in Razor [Friedman and Weld,
97].
These works considered the idea of recursive planning solely in the
context of data integration. The idea of this project is to take a
step back and try to formalize a more general framework of recursive
planning, in the wider context of AI planning. Some questions to ask
are: (1) how can we generalize the formalization of the AI planning
problem to recursive planning? (2) How can we extend planning
algorithms to create recursive plans? (3) what are other examples of
domains in which recursive planning can yield more succinct plans? (4)
How do we distinguish recursive planning and the more general,
hopeless problem of automatic programming?
- Wrapper Expressiveness Investigate the relative expressiveness
of different wrapper languages. How do the tree parsing codes (without
arbitrary PERL, of course) of Cohen's SPIRAL system (paper 1.3) compare to
HLRT and the other languages considered by Kushmerick (paper 3.3)? Could
one learn Cohen's representation?
- Query RoutingUse WordNet or some
other technique (Bayes network?) to automate and improve query routing for
a system like search broker.
- AdBot Build a system that automatically filters out unwanted
advertising banners. Use machine learning to learn a classifier that
recognizes ad banners and filters them out of a source page. The input to
the learner would be a web page; training examples would be the images on
the page, and whether or not they represent unwanted ads. The attributes
used by the learner could include features such as the URL in the HREF
enclosing the IMG tag, the SRC of the IMG, the width and height of the
image, maybe the location on the page. (Suggestion by Tessa Lau).