CSE 574/Win08 Project Proposal

CSE 574/Win08 Project Proposal

Jue Wang

juewang@cs.washington.edu

1. Problem -- Universal 2-D information extraction from the Web.

Ultimate goals to achieve:

1) Capability of extracting information from the 2-D organized Web pages rather than from a sequence of free text

2) Web site/template independent extraction

3) A machine learning approach that can handle the complexity of 2-D layout and arbitrary nested models, and requires very few training data and is capable to transfer across domains.

This technique would work as an essential building block for various applications:

1) Web entity-relation discovery and search, (a vertical instance: DBLife; a more general instance: Google’s attempts toward the universal search)

2) Dominant block / focused area identification from Web pages (e.g. dominant image extraction for products);

3) Adaptive viewing on mobile devices.

4) Finer-grained advertisement targeting.

6) Advertisement filtering;

7) Web page change detection and tracking along the time line.

The concrete problem in this project:

Domain: Computer science researchers’ academic home pages (UW CSE faculty & graduate students).

Information to be extracted: Contact info; research interest; bio; publications; classes; students, etc.

Challenges and opportunities: Each individual’s home page works as a single Web site, traditional wrapper induction techniques would break down. However, general template-independent features exists (e.g., contact info is generally in one of the top block of a Web page; consecutive blocks w./ repeating pattern tend to be instances of the same type.) and can be leveraged to build universal extractors.

This problem can also be viewed as a generalization of “Web table” extraction by noticing the fact that each Web page is actually a table w./ each block as a consistent table entry.

2. Approach & System Architecture

The first step is to reconstruct the 2-D structure of the Web page for later computing. DOM (Document-Object-Model) based tag tree is a natural representation. However, it suffers several fatal weak points: First, similar content might be represented by different tags in pages; Second, DOM trees generated by different templates can have entirely different topological structures, nodes of the same content can be located entirely different topologically. (For instance, news can be put at various levels of DOM trees) Third, Web pages’ html structures evolve with time rapidly. These weak points are a crucial reason why html/DOM structure based wrapper induction techniques are hard to generalize and adapt.

By noticing the fact that people can easily extract information from Web pages despite the versatility of Web page structures and even w./ unknown languages (e.g., it’s easy to identify the news title and news content from an Arabic news page), our conjecture is that many high level semantic information about the content is embedded in the Web pages’ visual structures. Thus besides the DOM tree, we also use the “Vision Tree” to represent Web pages. Each visual node contains visual information (size, position, etc) about the corresponding block on the rendered Web page.

We will explore features from the tag structure, visual structure, and the content of blocks/nodes. For example:

1. Position features: Left, Top (the coordinate of left-top corner of a block), NestedDepth;

2. Size features: Width, Height;

3. Rich format features: FontSize, IsBoldFont, IsItalic-Font;

4. Statistical features: ImageNumber, HyperlinkNumber, TextLength, ParagraphNumber, ItalicParagraphNumber, BoldPargraphNumber, TableNumber.

5. Shape features: the shape of the left boundary of the record region (this is particularly useful in search engine wrapper generation.)

6. Repeated pattern features: IsRepeated, LevelOfRepeating, HeaderOfRepeatedPartition

7. Color consistency.

8. Lexicon features (The keyword is in the AI conference list, etc.)

Based on the above representation and featurization, the backend machine learning framework would be Markov Logic Networks, an expressive and powerful statistical relational learning approach. Hopefully, the leverage from MLNs’ expressiveness will simplify the prior knowledge engineering with complex features a lot. And the inference algorithms for MLNs are capable to handle arbitrary structured models, providing the freedom for modeling.

The preliminary system architecture would be as follows:

The red modules are modules we will focus on in this project.

Page classifier:

Pre-classifying Web pages and mapping to corresponding prior knowledge base. (We will not look at this module at current phase, since our problem is in a specific domain: Computer Science Researcher’s homepages. But this would be a problem, if we are extracting info from a general crawl.

Page analyzer:

Rendering & analyzing Web pages, constructing the basic DOM tree & Visual tree information, extract the basic content info (text, anchor, multimedia elements) and detect repeating patterns.

Feature generator:

Computing various statistical content features (e.g. TF-IDF), lexicon features and visual features mentioned above.

Extractor learning:

Learning the extractor models for extracting records/entity’s properties (which could be other entities) and entity attributes. We claim that with the highly informative visual semantic features, the training would need only very few labeled samples. This matches the fact that human can do the extraction task very well by only looking at a handful samples. And hopefully, some of the regularities captured by the learning are easily transferred to another domain so that there is no need to rebuild extractors from scratch.

Record Extractor:

Entities are nestly stored in Web pages by their properties and attributes. Their properties could be other entities, (e.g., papers of a researcher). The extraction of entity property is roughly extracting records from Web pages and is different from extract individual attributes.

Attribute Extractor:

Extract entity attributes.

The record extractor and attribute extractor models could be learned jointly as well as the extraction process. The mutual reinforcement between the two will benefit both.

3. Development chunks

Page Classifier:

Done by limiting domain.

Page analyzer:

Done.

Feature generator:

Implementing the feature w./ code. Development effort is roughly linear to the dimension of core feature space. (Not sure about this dimension yet, the initial estimation would be smaller than 100).

Learner and Extractors (Learning and inference engine):

We will use the alchemy software package to perform the learning and inference. (Code may need to be modified for online complex feature computing.)

4. Milestone schedules

We will start from building basic I.I.D classifiers (log-linear models) for the record extractors and attribute extractors. Then we will connect them in a relational and joint way.

The estimated schedule would be:

1) Labeling, feature engineering & initial attempts for base classifiers – 1.5 Week

2) Completing building base classifier for the attributes (contact info, paper (no segmentation), bio, students, classes) – 1 Week.

3) Building the joint model and get it to work better – n weeks.

1. Source of training data.

Crawled from UW CSE web sites and manually labeled

2. Evaluation methods

Precision & Recall, computation efficiency in term of extraction speed, training time taken, amount of training data required.