The Strudelizer:
Extracting Semi-structured Data from Web Pages

Brian Michalowski
Steve Wolfman

Introduction

Large web sites are difficult to maintain. They contain both data and layouts that need to be kept consistent, and maintaining this consistency often requires a lot of tedious manual editing. The Strudel system, created at AT&T, is designed to alleviate these problems by storing the data in a database and using queries and templates to create the finished HTML pages. This makes the process of keeping data and layouts consistent much easier, and allows web masters to easily create multiple views of the data to tailor portions of the web site to different visitors.

While Strudel makes web site management easier, many users may be reluctant to use Strudel because of the large amount of time they would need to invest. Some of this time would be spent learning to interact with Strudel, a problem that well-designed interfaces could alleviate. A potentially more severe problem is the time it would take to convert an existing Web site over to Strudel. Even for modest web sites containing hundreds of pages, converting to Strudel would be time-consuming and tedious. A system that could help with this conversion process would greatly reduce the start-up costs of using Strudel, allowing the user to reap the benefits of Strudel much sooner

This document describes the Strudelizer, a system designed to ease the transition from a traditional web site to a Strudel-based web site. We present a design for the system, and suggest future work.

Design principles

Aid the user

We believe that the Strudelizer would be the most effective if it were a computer-aided system, performing repetitive tasks for the user when possible but continually asking the user for feedback, than a computer-automated system. A fully automatic system would be ideal if all of the information it needed were contained in the web pages and if it were 100% accurate, but we don't believe that either criterion is feasible. Many pages do not contain labels for all of the data within them. For example, a web page at the Department of Computer Science and Engineering may list the name "Dan Weld" without mentioning that he is a faculty member. It is expected that the visitor to the site will infer that he is a faculty member. A fully automatic system would not be able to recognize this.

More importantly, a computer-automated system would only be helpful if it worked 100% of the time. However, even if all the necessary information were present in each page, achieving this level of accuracy seems unlikely given the huge amount of variety across web sites. If the system did not function perfectly, the user would need to go through the entire site checking for errors, a process that would be nearly as time-consuming as the original task of converting the web site over to Strudel. For this reason, we propose a computer-aided system.

This approach has the added benefit of allowing the user to develop a mental model of the web site as he or she interacts with the Strudelizer. After this process the user will be much better prepared to write queries and templates to create the new web site than a user who was totally excluded from the Strudelizing process.

Take advantage of structure in HTML

The HTML structure of a web page can indicate a lot about the structure of the data it contains, a fact that is very useful for extracting data from a page. We designed a pattern schema closely reflecting the HTML parse tree of a page that can express the patterns in the HTML code.

A pattern is a hierarchical collection of pattern-nodes, where a pattern-node represents a single node of the parse tree (like a UL node for an unsigned list). If the pattern-node matches a node of the parse tree, the pattern matcher will then try to match the pattern-node's subtree with the children of the node in the parse tree. Each pattern-node also contains information about whether it represents a DDL node or a DDL field (or neither). Even if a pattern-node does not represent data, its children may well contain data (e.g., a TR node may not represent data, but its children usually do).

These patterns are flexible in that they can match any HTML structure. Moreover, they can be constructed easily by asking the user at each level of the parse tree whether that object is a data block, whether it should be expanded further, and if it should be a data block what label that data should get. If the current object is a data block, it represents Strudel data and its pattern-node will use its name to classify the data. If it should be expanded further, pattern-nodes will be built recursively for its children.

An important feature of this format is that it allows the Strudelizer to reuse patterns. The Strudelizer can attempt to apply a previously learned or pre-programmed pattern before bothering the user. If the pattern matches, the Strudelizer can ask the user to verify the extracted data, or if the user has enough confidence in the Strudelizer, it can enter the data without verification.

Take advantage of structure in text blocks

Blocks of text may contain data that is not delineated by HTML tags. For example, a professor's research interests may be presented as a comma-delineated list, and phone numbers only appear in certain patterns. The Strudelizer should be able to accurately extract this information while prompting the user for help as infrequently as possible. We have a couple of ideas about how this would work. One approach would be to allow the user to write regular expressions indicating how this information should be parsed. This would require the user to know some programming concepts, but is simple and would give the user complete control over the data extraction process. A more intelligent system might ask the user to demonstrate how to extract the data for one particular example, and would then infer how to handle the other examples. If this approach worked, it would make the conversion much easier on the user.

Future Work

The Strudelizer is still in its infancy, and is more a design proposal than an implemented system. However, even beyond implementing the system as it's described, there is room for a lot of future work:

Interface Issues

Since the Strudelizer is intended to make the conversion of a web site to Strudel as easy as possible, the Strudelizer itself should be as easy to use as possible. (Especially since the Strudelizer could conceivably be an interface to Strudel.) It is not clear what the granularity of interaction should be between the Strudelizer and the user. For example, asking the user for help at every node in the parse tree would give the user control of the conversion process, but would be time-consuming and would probably bog down the user more than necessary. On the other hand, one could envision a Strudelizer language in which the user could express patterns that he or she wanted the Strudelizer to interpret, an approach which would be more time-consuming for the user initially but greatly speed up the conversion process.

It is also not clear how to display the Strudelizer's navigation through a document. Possibilities include outputting the first few lines of the current section of HTML code that is being examined, and highlighting in a browser window the section of the page being examined. The latter choice is more intuitive, but may be difficult to implement.

Functionality

The system proposed so far just covers some of the tasks that a full-fledged Strudelizer would need to do. For example, there are still many issues involved in determining how free text that does not fit any regular expression should be entered into Strudel. Should a paragraph be considered a data item? Should each sentence be an individual data item? How should the links in text and the labels for those links be treated? These issues will clearly need to be resolved before the Strudelizer is ready for mass consumption.

So far we have only discussed how to populate a Strudel database with information from a web site. The full-fledged Strudelizer should also be able to automatically create queries and templates based on the web pages it has seen, so that the user can recreate the original web site with little effort. While creating these queries and templates is an imposing task, it may be feasible, since throughout this process the user has expressed patterns describing the relationship between the text and the data.

Evaluation

Finally, we need some way of gauging the effectiveness of the Strudelizer. Since this is going to be a computer-aided system and not fully automatic, the measure will need to be based more on subjective factors such as user satisfaction than on accuracy or efficiency.

Conclusions

We have presented a design for a system to ease the transition from a traditional to Strudel-based Web site. We have described a pattern-matching system to remove much of the redunancy from this task, and have listed several issues that will need to be resolved as well as room for future work. The next step will be to implement these design ideas and gauge their effectiveness, eventually trying them out by converting an actual Web site.

The Strudelizer: Extracting Semi-structured Data from Web Pages

Brian Michalowski Steve Wolfman