CSE 574/Win08
Project Proposal
1. Problem -- Universal 2-D information extraction from the
Web.
Ultimate
goals to achieve:
1) Capability of extracting information from the 2-D organized Web pages
rather than from a sequence of free text
2) Web site/template independent extraction
3) A machine learning approach that can handle the complexity of 2-D
layout and arbitrary nested models, and requires very few training data and is
capable to transfer across domains.
This
technique would work as an essential building block for various applications:
1) Web entity-relation discovery and search, (a vertical instance: DBLife; a more general instance: Google’s attempts toward
the universal search)
2) Dominant block / focused area identification from Web pages (e.g.
dominant image extraction for products);
3) Adaptive viewing on mobile devices.
4) Finer-grained advertisement targeting.
6) Advertisement filtering;
7) Web page change detection and tracking along the time
line.
The concrete
problem in this project:
Domain: Computer science
researchers’ academic home pages (UW CSE faculty & graduate
students).
Information to be extracted:
Contact info; research interest; bio; publications; classes; students,
etc.
Challenges and
opportunities: Each individual’s home
page works as a single Web site, traditional wrapper induction techniques would
break down. However, general
template-independent features exists (e.g., contact info is generally in one of
the top block of a Web page; consecutive blocks w./ repeating pattern tend to be
instances of the same type.) and can be leveraged to build universal extractors.
This
problem can also be viewed as a generalization of “Web table” extraction by
noticing the fact that each Web page is actually a table w./ each block as a
consistent table entry.
2. Approach & System
Architecture
The
first step is to reconstruct the 2-D structure of the Web page for later
computing. DOM (Document-Object-Model) based tag tree is a natural
representation. However, it suffers several fatal weak points: First, similar
content might be represented by different tags in pages; Second, DOM trees
generated by different templates can have entirely different topological
structures, nodes of the same content can be located entirely different
topologically. (For instance, news can be put at various levels of DOM trees)
Third, Web pages’ html structures evolve with time rapidly. These weak points
are a crucial reason why html/DOM structure based wrapper induction techniques
are hard to generalize and adapt.
By
noticing the fact that people can easily extract information from Web pages
despite the versatility of Web page structures and even w./ unknown languages
(e.g., it’s easy to identify the news title and news content from an Arabic news
page), our conjecture is that many high level semantic information about the
content is embedded in the Web pages’ visual structures. Thus besides the DOM
tree, we also use the “Vision Tree” to represent Web pages. Each visual node
contains visual information (size, position, etc) about the corresponding block
on the rendered Web page.
We
will explore features from the tag structure, visual structure, and the content
of blocks/nodes. For example:
1.
Position features:
Left, Top (the coordinate of left-top corner of a block), NestedDepth;
2.
Size features:
Width, Height;
3.
Rich format
features: FontSize, IsBoldFont, IsItalic-Font;
4.
Statistical
features: ImageNumber, HyperlinkNumber, TextLength, ParagraphNumber, ItalicParagraphNumber, BoldPargraphNumber, TableNumber.
5.
Shape features: the
shape of the left boundary of the record region (this is particularly useful in
search engine wrapper generation.)
6.
Repeated pattern
features: IsRepeated, LevelOfRepeating, HeaderOfRepeatedPartition
7.
Color
consistency.
8.
Lexicon features
(The keyword is in the AI conference list, etc.)
Based on the above representation and featurization, the backend machine learning framework would
be Markov Logic Networks, an expressive and powerful statistical relational
learning approach. Hopefully, the leverage from MLNs’
expressiveness will simplify the prior knowledge engineering with complex
features a lot. And the inference algorithms for MLNs
are capable to handle arbitrary structured models, providing the freedom for
modeling.
The
preliminary system architecture would be as follows:
The red modules are modules we will focus on in this
project.
Page
classifier:
Pre-classifying Web pages and mapping to corresponding
prior knowledge base. (We will not look at this module at current phase, since
our problem is in a specific domain: Computer Science Researcher’s homepages.
But this would be a problem, if we are extracting info from a general crawl.
Page
analyzer:
Rendering & analyzing Web pages, constructing the
basic DOM tree & Visual tree information, extract the basic content info
(text, anchor, multimedia elements) and detect repeating patterns.
Feature
generator:
Computing various statistical content features (e.g.
TF-IDF), lexicon features and visual features mentioned
above.
Extractor
learning:
Learning the extractor models for extracting
records/entity’s properties (which could be other entities) and entity
attributes. We claim that with the highly informative visual semantic features,
the training would need only very few labeled samples. This matches the fact
that human can do the extraction task very well by only looking at a handful
samples. And hopefully, some of the regularities captured by the learning are
easily transferred to another domain so that there is no need to rebuild
extractors from scratch.
Record
Extractor:
Entities are nestly stored in
Web pages by their properties and attributes. Their properties could be other
entities, (e.g., papers of a researcher). The extraction of entity property is
roughly extracting records from Web pages and is different from extract
individual attributes.
Attribute
Extractor:
Extract entity attributes.
The
record extractor and attribute extractor models could be learned jointly as well
as the extraction process. The mutual reinforcement between the two will benefit
both.
3. Development chunks
Page
Classifier:
Done
by limiting domain.
Page
analyzer:
Done.
Feature
generator:
Implementing the feature w./ code. Development effort is
roughly linear to the dimension of core feature space. (Not sure about this
dimension yet, the initial estimation would be smaller than 100).
Learner and
Extractors (Learning and inference engine):
We
will use the alchemy software
package to perform the learning and inference. (Code may need to be modified for
online complex feature computing.)
4. Milestone schedules
We
will start from building basic I.I.D classifiers (log-linear models) for the
record extractors and attribute
extractors. Then we will connect them in a relational and joint way.
The
estimated schedule would be:
1) Labeling, feature engineering & initial attempts
for base classifiers – 1.5
Week
2) Completing building base classifier for the
attributes (contact info, paper (no segmentation), bio, students, classes) – 1 Week.
3) Building the joint model and get it to work better –
n weeks.
1. Source of training data.
Crawled from UW CSE web sites and manually
labeled
2. Evaluation methods
Precision & Recall, computation efficiency in term
of extraction speed, training time taken, amount of training data required.