The Strudelizer:
Extracting Semi-structured Data from Web Pages
Brian Michalowski
Steve Wolfman
Introduction
Large web sites are difficult to maintain. They
contain both data and layouts that need to be kept consistent, and maintaining
this consistency often requires a lot of tedious manual editing. The Strudel
system, created at AT&T, is designed to alleviate these problems by
storing the data in a database and using queries and templates to create
the finished HTML pages. This makes the process of keeping data and layouts
consistent much easier, and allows web masters to easily create multiple
views of the data to tailor portions of the web site to different visitors.
While Strudel makes web site management easier,
many users may be reluctant to use Strudel because of the large amount
of time they would need to invest. Some of this time would be spent learning
to interact with Strudel, a problem that well-designed interfaces could
alleviate. A potentially more severe problem is the time it would take
to convert an existing Web site over to Strudel. Even for modest web sites
containing hundreds of pages, converting to Strudel would be time-consuming
and tedious. A system that could help with this conversion process would
greatly reduce the start-up costs of using Strudel, allowing the user to
reap the benefits of Strudel much sooner
This document describes the Strudelizer, a
system designed to ease the transition from a traditional web site to a
Strudel-based web site. We present a design for the system, and suggest
future work.
Design principles
Aid the user
We believe that the Strudelizer would be the most
effective if it were a computer-aided system, performing repetitive tasks
for the user when possible but continually asking the user for feedback,
than a computer-automated system. A fully automatic system would be ideal
if all of the information it needed were contained in the web pages and
if it were 100% accurate, but we don't believe that either criterion is
feasible. Many pages do not contain labels for all of the data within them.
For example, a web page at the Department of Computer Science and Engineering
may list the name "Dan Weld" without mentioning that he is a faculty member.
It is expected that the visitor to the site will infer that he is a faculty
member. A fully automatic system would not be able to recognize this.
More importantly, a computer-automated system
would only be helpful if it worked 100% of the time. However, even if all
the necessary information were present in each page, achieving this level
of accuracy seems unlikely given the huge amount of variety across web
sites. If the system did not function perfectly, the user would need to
go through the entire site checking for errors, a process that would be
nearly as time-consuming as the original task of converting the web site
over to Strudel. For this reason, we propose a computer-aided system.
This approach has the added benefit of allowing
the user to develop a mental model of the web site as he or she interacts
with the Strudelizer. After this process the user will be much better prepared
to write queries and templates to create the new web site than a user who
was totally excluded from the Strudelizing process.
Take advantage of structure in HTML
The HTML structure of a web page can indicate
a lot about the structure of the data it contains, a fact that is
very useful for extracting data from a page. We designed a pattern
schema closely reflecting the HTML parse tree of a page that can express
the patterns in the HTML code.
A pattern is a hierarchical collection of pattern-nodes,
where a pattern-node represents a single node of the parse tree (like a
UL node for an unsigned list). If the pattern-node matches a node of the
parse tree, the pattern matcher will then try to match the pattern-node's
subtree with the children of the node in the parse tree. Each pattern-node
also contains information about whether it represents a DDL node or a DDL
field (or neither). Even if a pattern-node does not represent data, its
children may well contain data (e.g., a TR node may not represent data,
but its children usually do).
These patterns are flexible in that they can
match any HTML structure. Moreover, they can be constructed easily by asking
the user at each level of the parse tree whether that object is a data
block, whether it should be expanded further, and if it should be a data
block what label that data should get. If the current object is a
data block, it represents Strudel data and its pattern-node will use its
name to classify the data. If it should be expanded further, pattern-nodes
will be built recursively for its children.
An important feature of this format is
that it allows the Strudelizer to reuse patterns. The Strudelizer can attempt
to apply a previously learned or pre-programmed pattern before bothering
the user. If the pattern matches, the Strudelizer can ask the user to verify
the extracted data, or if the user has enough confidence in the Strudelizer,
it can enter the data without verification.
Take advantage of structure in text blocks
Blocks of text may contain data that is not delineated
by HTML tags. For example, a professor's research interests may be presented
as a comma-delineated list, and phone numbers only appear in certain patterns.
The Strudelizer should be able to accurately extract this information while
prompting the user for help as infrequently as possible. We have a couple
of ideas about how this would work. One approach would be to allow the
user to write regular expressions indicating how this information should
be parsed. This would require the user to know some programming concepts,
but is simple and would give the user complete control over the data extraction
process. A more intelligent system might ask the user to demonstrate how
to extract the data for one particular example, and would then infer how
to handle the other examples. If this approach worked, it would make the
conversion much easier on the user.
Future Work
The Strudelizer is still in its infancy, and is
more a design proposal than an implemented system. However, even beyond
implementing the system as it's described, there is room for a lot of future
work:
Interface Issues
Since the Strudelizer is intended to make the
conversion of a web site to Strudel as easy as possible, the Strudelizer
itself should be as easy to use as possible. (Especially since the Strudelizer
could conceivably be an interface to Strudel.) It is not clear what the
granularity of interaction should be between the Strudelizer and the user.
For example, asking the user for help at every node in the parse tree would
give the user control of the conversion process, but would be time-consuming
and would probably bog down the user more than necessary. On the other
hand, one could envision a Strudelizer language in which the user could
express patterns that he or she wanted the Strudelizer to interpret, an
approach which would be more time-consuming for the user initially but
greatly speed up the conversion process.
It is also not clear how to display the Strudelizer's
navigation through a document. Possibilities include outputting the first
few lines of the current section of HTML code that is being examined, and
highlighting in a browser window the section of the page being examined.
The latter choice is more intuitive, but may be difficult to implement.
Functionality
The system proposed so far just covers some of
the tasks that a full-fledged Strudelizer would need to do. For example,
there are still many issues involved in determining how free text that
does not fit any regular expression should be entered into Strudel. Should
a paragraph be considered a data item? Should each sentence be an individual
data item? How should the links in text and the labels for those links
be treated? These issues will clearly need to be resolved before
the Strudelizer is ready for mass consumption.
So far we have only discussed how to populate a Strudel
database with information from a web site. The full-fledged Strudelizer
should also be able to automatically create queries and templates based
on the web pages it has seen, so that the user can recreate the original
web site with little effort. While creating these queries and templates
is an imposing task, it may be feasible, since throughout this process
the user has expressed patterns describing the relationship between the
text and the data.
Evaluation
Finally, we need some way of gauging the effectiveness
of the Strudelizer. Since this is going to be a computer-aided system and
not fully automatic, the measure will need to be based more on subjective
factors such as user satisfaction than on accuracy or efficiency.
Conclusions
We have presented a design for a system to ease
the transition from a traditional to Strudel-based Web site. We have
described a pattern-matching system to remove much of the redunancy from
this task, and have listed several issues that will need to be resolved
as well as room for future work. The next step will be to implement
these design ideas and gauge their effectiveness, eventually trying them
out by converting an actual Web site.