AdBot Project Report

Tessa Lau and Corin Anderson
CSE 574, 3/20/98

AdBot is a system that learns to classify advertising images on the web. It uses the RIPPER rule learning system to generalize rules that describe advertising images, given a number of training examples of both positive (ad) and negative (non-ad) image instances.

1. Introduction

The economic model of web content companies has been evolving these last several years. The piece of this model that is salient to our work is how a web content company brings in revenue. In recent times, the revenue source of such companies has been other companies that pay to have banner ads on the web content company's pages.

While this model seems to have taken a good hold, many web surfers would rather not see banner ads on the pages that they visit. The reasons vary: for some, the banner ads take time to download, especially on slow modem connections for which users pay a premium. For other users, the banners are simply an annoyance that would be nice to remove from the page.

In our work, we have addressed the needs of web users who would like banner ads automatically removed from the pages that they view. We describe a system, AdBot, that uses machine learning to learn to classify web images (whether the image is a banner ad or not). This classifier could then be used in an HTTP proxy to automatically filter out banner ads while browsing the web. This paper describes the AdBot architecture, presents our results, and concludes with ideas for future work.

2. AdBot Architecture

We used the RIPPER rule learning system as the learning component of AdBot. RIPPER uses a sequential covering algorithm to learn ordered sets of rules that describe possibly noisy training data. To facilitate classifying text, RIPPER supports set-valued attributes.

Figure 1: AdBot's interface for collecting training examples. Given a URL (in the box at the top), all the images on that web page are fetched and displayed in the window. The checkboxes on the left are used to mark positive (ad) examples.

We model the ad classification problem as follows. Each image viewed on the web is either a positive (ad) or negative (not ad) example. A graphical interface (Figure 1) aided in collecting training examples.

Figure 2: HTML context surrounding the image. We include as an attribute the information in the tags near the image (red) in the HTML parse tree. We chose the three parents of the image node (green) and their immediate left and right siblings (blue).

Each training example is modelled using a number of attributes:

The HTML context enclosing the image
- Whether the image is embedded in an anchor
- Whether it's embedded in a table
- Whether it's in a list (ordered or unordered)
Image's context in the HTML parse tree (see Figure 2)
- Tag text of three parents of the image node
- Tag text of immediate left and right siblings of those parents
Words associated with URLs
- Alt text of images
- Anchor text for links
- Components of the image's URL
- Components of enclosing anchor's URL
Details about the image
- Image height (if specified in the HTML file)
- Image width (if specified in the HTML file)
- Aspect ratio (height / width)

The first three attributes (HTML context) are encoded as binary attributes. The last three attributes (image size and aspect ratio) are encoded as real-valued attributes. All other attributes are encoded using RIPPER's set-valued attributes.

3. Results

To conduct preliminary tests, we collected examples of 501 images from 53 different web pages. Of these 501 images, 87 were positive instances of banner ads. We trained RIPPER on 447 randomly selected examples of the 501 and produced the following rules (Note: we set RIPPER's optimization flag to 5):

AD :- ANCHORURL ~ com, HEIGHT>=55, BASEURL ~ www (26/3).
AD :- ANCHORURL ~ com, IMAGEURL ~ ads (21/4).
AD :- ANCHORURL ~ http, TAGS ~ height (4/0).
AD :- ANCHORURL ~ com, BASEURL ~ html (7/3).
AD :- TAGS ~ relocate (2/0).
AD :- ANCHORURL ~ netscape (3/0).
default NON_AD (366/8).

Using the remaining 54 examples, we found the error rate of the hypothesis to be 9.26% +/- 3.98% on the test set.

4. Conclusions and Future Work

In this work, we have put forth a simple system of classifying images on the web based on the referring page and its HTML tag content. When applied to the realm of web advertisements, we found that the system did a reasonable job.

In classifying examples as ad or not-ad images, we found that the distinction between ads and not-ads is not always clear. For one, the context surrounding an image determines whether or not it's an ad. For instance, a Netscape logo on Netscape's corporate web site is probably not an ad, while the same logo on another web page might be an advertisement for that company's web browser product. For another, targeted advertising begins to blur the line between unwanted advertising and information of interest to the consumer. For example, an ad for a low price on a Compaq laptop may be of interest to the consumer shopping for a new computer.

A future project related to this work would be to explore further what regularities exist between web pages. In specific, it would be interesting to see if other elements on a web page could be correctly identified automatically. For example, it might be nice to automatically detect, say, links to pages with similar content from a given page. Also, there's no reason that the only filter available is to remove the component; highlighting a piece of text on a page, or moving it to the top of a web page, or automatically filing it into a database are also possibilities for a filter.