Printing the Web

Jason Staczek

CSE 574

20 March 1998

Introduction

While information continues to multiply on the Web, it remains difficult to take interesting chunks away from a site, and even more difficult to produce attractive, portable printed forms of web information. Users have few choices beyond consuming information online directly through the browser. Some tools do offer offline browsing but typically require taking an entire site tree or specifying retrieval options at the page level. From a user standpoint, neither method is satisfactory. Taking the tree leaves little flexibility, takes too much information and captures many elements that are useful only in a browser and have no function on the printed page. Alternatively, page-by-page specification is tedious at best, and leaves no guarantee that one will get the full set of desired information. To mitigate this, some content creators provide packaged data options such as PDF documents. Unfortunately, users have no control over their content, and production of these portable, printable formats is a potentially serious maintenance burden for content creators.

One would like a method that would allow a user (or a user's agent) to approach a site, extract some subset of information from the site, then repackage that information for offline consumption, preferably printing. Our assumption is that having retrieved the desired hypertext, the ability to browse offline is given. Repackaging for print presents additional problems-the aesthetic principles driving printed page layout are quite different from those governing the screen.

Recent work on the declarative specification of web sites has opened some interesting opportunities for realizing this functionality. Systems such as Araneus [AGM97], WebOQL [AM98] and Strudel [FFK+97] implement navigation-based web site querying and data restructuring. In contrast to the tree or page-level data extraction options currently available, these systems operate at much finer granularities. While these scales may provide the control necessary to implement a usable data selection and retrieval system, they offer too much flexibility for the typical user and impenetrable interfaces in their raw forms. Further, the restructuring capabilities in these systems are targeted toward hypertext or graph forms-it's not clear that these languages include ordering constructs that allow a graph of data objects to be flattened into a natural, linear format.

This paper attempts to define the web printing problem, decompose it into sub-problems, define desiderata for a solution and finally, outlines a Strudel-based approach in progress. We begin with a brief example to illustrate the problem.

An Example

Your bus leaves in five minutes for a half hour ride, but you'd like to catch up on the Clinton story during the trip. Browsing to CNN.com reveals a top story page that contains, in addition to the text of the article:

1 "All Politics" logo; links to Home, News, Analysis, Community and CNN.com; 1 sidebar with teasers and four links to related stories; 1 sidebar with teasers and two links to press conference transcripts; 1 sidebar with a link to a chronology of the crisis; 1 sidebar with a link to the cast of characters; 1 sidebar with three links to poll results; 1 sidebar with four links to related stories in Time magazine this week; 1 sidebar with a link to a "Lewinsky Crisis" message board; 1 sidebar naming the editorial staff on these articles; 1 sidebar with 5 ads; 15 links to "other" news at the bottom of the page; links back to Archives, CQ News, Time on Politics, and Search; a "Feedback and Help" link; 1 copyright notice; 1 link to a "Terms and Conditions" page; 1 link to job openings at CNN; 1 link to more staff information

Embedded in the third paragraph of the text of the top story is a callout titled "Also in this story" with 6 links to subheaded paragraphs further in the article, each separated by about four to ten sentences. The article also contains 6 photographs inline.

You have five minutes to select what you'd like to read, format it and print it. Given a print speed on the order of four pages per minute and allowing for download time, you have perhaps two minutes to locate the data and make your selection. While this time constraint may seem somewhat artificial, the process is naturally constrained by the reading time for the amount of data selected. In no case should it take longer to select the information than it would take to read it in the first place.

Assume you've made the following selection: top story, related stories, transcripts, chronology, cast of characters, poll results, the Time magazine articles and all photographs. These should be printed attractively, in a readable order, on as few pages as possible, with closely related content arranged together. Ads, web-only navigation structure and other CNN.com site-related links should be removed. Finally, a small summary index should be included to indicate what the printed package contains.

Today, there are only two ways to accomplish this, neither satisfactory. You can try printing the current page from the browser, selecting the "print all linked pages" option. This may cover the material you're interested in, but it will certainly include much that you're not. Also, the final output will simply be an image of each web page with no concessions made to the paper medium. Otherwise, you can surf to each page of interest and print to an application like Clickbook that will assemble pages into a finished book. This process will likely take longer than it would to read the material, and leaves no guarantee that you locate all desired material.

Problem decomposition

Using the previous example as a guide, we believe that the printing problem can be viewed as four distinct sub-problems: data selection or expression of the information goal; data retrieval; data ordering and grouping; formatting and indexing. This section describes each sub-problem in additional detail.

Data Selection and Information Goals

The current discussion takes a single-site centric view of the selection problem. The selection mechanism could be extended to span multiple sites if one wishes to consider the information integration problems this raises. Here, we confine the problem to a single site on the assumption that the site creators have produced a coherent and largely non-overlapping set of related information linked in some logical structure.

We assume that a user or user's agent approaches a single web site with the intent to remove some interesting subset of information available at that site.

We consider the user case first. The user arrives at the site with an information goal in mind, and assumes that it can be met at the site. The user may have no knowledge of the actual breadth, depth or structure of the information that could satisfy the goal at the site. To allow the user to express the information goal, the site must expose a selection interface with the following attributes:

Attribute

Justification

Compact

The selection interface should not require the transmittal of more information than would be required to satisfy the information goal itself

Expressed semantically

The information goal is expressed semantically. The interface should not require the user to translate semantic intent to the site's particular syntactic realization.

Support hierarchical exposure of detail

The user may approach with a high-level goal in which case they need not see all detail available at the site. Alternately, they may approach with a very specific goal in which case the hierarchy should support quick navigation to that end.

Support navigation at smaller than page granularities

The CNN example illustrates that pages contain much information that serves no purpose in satisfying information goals. The user must be able to select only what is required and not be forced to carry additional page baggage.

Expose data typing information

The CNN example illustrates that data may be requested or rejected on basis of type, such as excluding advertisements or including photograph.

Support marking while browsing

In the interest of speed, we suggest that the user process of determining how to satisfy the information goal be sufficient to specify the query.

Given an adequately typed semantic representation of site information, realizing such an interface is largely an HCI problem.

We consider the case of a user's agent attempting to satisfy an information goal on behalf of the user. We assume that the user arms an agent with a goal specifying the desired information, some upper and lower bounds on the amount of data to retrieve, and some bounds on the cost the user is willing to assume. The information portion of the goal could be expressed in general terms, or in language specific to the site, perhaps having been modified from a previous successful excursion to the site. To support this interaction, the site should present an agent interface with the following characteristics:

Attribute

Justification

Support both semantic and syntactic views

The interface should allow the agent freedom in determining how to satisfy the goal. If a semantic approach fails to realize the data, the agent should be free to fall back on a syntactic approach, perhaps enumerating available data and evaluating content itself.

Support negotiation and reuse of goal satisfaction results

In order to take advantage of goal-seeking knowledge accumulated at a site, a site should be able to instruct an agent that it has recognized and fulfilled the presented goal and that it requires no further action on the agent's part. Likewise, if an agent resorts to its own techniques to satisfy the goal, it should be able to communicate these results to the site for reuse when a similar request is presented in the future.

Support cost-based negotiation

Ideally, the agent should be able to express, "Here's what I want. Give me all you can for $x".

Data Retrieval

Assuming that a selection has been made, the site must export an interface to allow retrieval of the selected information. Such an interface should include the following attributes:

Attribute

Justification

Support transfer in chunks addressable to the information goal, including semantic and type information.

To facilitate packaging, the interface must separate and tag data according to terms of the information goal.

Support optional inclusion of formatting data

The user may wish to retrieve raw data for simple repackaging on a low-capability device.

Optionally support metering capabilities

The user may wish to limit the cost associated with retrieving the selected data.

This work seems to be largely standards and software engineering related.

Data ordering and grouping

Given a retrieved body of data at the client that satisfies the information goal, it remains to order and group the constituent pieces into a readable and structure. For our purposes, we consider the problem of restructuring the data into a linear format that can be realized using common paper-based grouping features typical of newspapers or magazines, such as sidebars and continuations. Note that this ordering and grouping step is distinct from the physical page layout process. The goal here is to assemble a schematic ordering and grouping based on importance, with closely related information clustered together. For example, one possible result of the ordering could be an importance ranked list with weighted links connecting related list entries. Any such ordering should have the following attributes:

Attribute

Justification

Well-ordered

Assumes that the reader will attempt to process the information linearly. The aim of the information goal should be satisfiable most quickly by consuming the information in the resulting order.

Clustered

The ordering should include clustering information to allow the layout step to place closely related information together as space permits (in a sidebar, for example). Clustering directives may be constraint-based.

Concise

While we assumed that approaching a single site would mitigate information integration problems, the ordering step should attempt to elide or otherwise mark duplicate or near duplicate information.

It's not clear how such an ordering might be achieved. The design of an algorithm will depend on the meta-data supplied by the site with each chunk. There may also be an opportunity to exploit content-based linguistic approaches.

Formatting and indexing

Given an element ordering and grouping, it remains to produce a physical page layout on the desired medium. In addition to placing text on the page, this step will also produce any paper navigation elements such as indexes that require knowledge of final physical text location. Final output should have the following attributes:

Attribute

Justification

Attractive

The output conforms to conventional print aesthetics with respect to line widths, type sizes, use of white space, application of emphasis, margins, etc.

Compact

The layout is formatted to take full advantage of the available media area, leaving as little waste as possible.

Paper-navigable

The layout takes advantage of typical paper-navigation elements such as running headers and footers, continuations, sidebars, tables of contents and indexes.

It may be possible to employ a constraint-based approach as in [BLM97] to realize this step.

A Strudel-based Approach

Work-in-progress

Our current work attempts to use Strudel [FFK+97] to explore the solution space for this problem. Specifically, we believe that the Strudel model can provide an excellent platform for addressing the data selection and retrieval sub-problems. Strudel's semi-structured data model can naturally support navigation at small granularities, hierarchical exposure of detail and typing information. Further, the model may provide the ability to capture meta-data useful for solving the ordering and grouping sub-problem. The formatting and indexing problem is independent of the platform chosen for web site management.

Immediate work focuses on a browser for the data structures underlying a Strudel-based site. These are: the site schema [FFLS98b], a graphical abstraction of site structure as specified in a StruQL query; the site graph, a directed graph representation of the result of applying the query to a Strudel database; and the DataGuide [GW97], a "collapsed" version of the site graph. As of this writing, we have built a prototype schema browser that allows limited experimentation. We believe that the DataGuide is the most promising structure for presentation to the user. Unlike the site schema, it accurately reflects the exact contents of a site in a more compact form than the site graph. Since Strudel does not currently produce a DataGuide for the site graphs that it builds, this capability is being added to the browser.

We begin with the browser to assist in overcoming one of the two main barriers to experimentation, the lack of sites implemented using Strudel. First, we believe that the browser will enable us to debug and construct test sites suitable for experimentation. As a side effect, we believe that the browser will assist in implementing integrity constraint checking [FFLS98a], a feature critical for building printable sites. Second, we note that the lack of available wrappers makes site construction tedious at best. This will be addressed either through wrapper construction, or through a modification of Strudel to operate on SQL-accessible databases.

Given the ability to construct and browse test sites, the browser will be used for HCI experimentation to determine the suitability of the DataGuide for supporting the attributes listed in the Data Selection section. At this point, it's not clear how Strudel's model will need to be adapted to fully support the goals set forth for the human interface. For example, it may be necessary to allow additional annotations in the data graph and StruQL query to provide enough semantic information to guide a human user.

Future Work

The current work focuses exclusively on the human interface for the data selection sub-problem. Much of the problem remains open, including:

Related Work

At this time, we are not aware of other work specifically related to this topic. However, several commercial applications announced at Spring Seybold 1998 appear to be relevant to portions of the problem.

Zuno Limited's (www.zuno.com) Coyote and Digital Publisher products support digital libraries with agent-based searching and transactional and time-based charging models.

Pindar, Inc.'s (www.pindar.com) Active Print enables merchandise dealers to create custom catalogs by selecting product information, prices and pictures, and applying customized page layouts. After selecting information from a database, the information is sent to a page builder. The page builder uses the pagination tools to make up the pages according to instructions supplied by the dealer. From the page builder, pages are sent to a local print site, where files are printed and finished according to the specifications set by the dealer. The system is designed for bulk printing of highly structured catalog information.

Adobe Systems, Inc. (www.adobe.com) has announced the Acrobat-based Blue Delivery product designed for publishers to deliver highly formatted information directly to readers. Blue Delivery takes electronic print-ready pages and reformats them into landscape format for on-screen viewing.

Ncompass Labs, Inc. (www.ncompasslabs.com) offers an integrated Web publishing system, called Resolution. The system is database driven and relies on a template architecture to deliver customized Web pages on the fly. Page templates define the style, layout and behavior of Web pages and each template can vary to handle different browsers, operating systems, devices, etc. When a page is requested or delivered, it is dynamically assembled by the server by merging the content, shared elements and templates stored in the database. The server customizes pages based on the nature of the targeted recipient.

Conclusion

We defined the problem of printing from the web, describing it in terms of data selection, data retrieval, ordering and grouping, and formatting and indexing subproblems. For each subproblem we proposed a set of attributes describing a desirable solution. We proposed that Strudel provides an excellent platform for exploration of the solution space, and described preliminary work toward that end.

References 

[AGM97] Paolo Atzeni, Giansalvatore Mecca and Paolo Merialdo, To Weave the Web, 1997 Conference on Very Large Database Systems (VLDB-97).

[AM98] Gustavo O. Arocena, Alberto O. Mendelzon, WebOQL: Restructuring Documents, Databases and Webs, 1998 International Conference on Data Engineering (ICDE-98).

[BLM97] Alan Borning, Richard Lin, and Kim Marriott, Constraints and the Web, Proceedings of the 1997 ACM Multimedia Conference, pages 173-182.

[FFK+97] Mary Fernandez, Daniela Florescu, Jaewoo Kang, Alon Levy and Dan Suciu, Catching the Boat with Strudel: Experiences with a Web-Site Management System, Draft, 1997.

[FFLS98a] Mary Fernandez, Daniela Florescu, Alon Levy and Dan Suciu, Reasoning About Web-Site Structure, Draft, 1998.

[FFLS98b] Mary Fernandez, Daniela Florescu, Alon Levy and Dan Suciu, Warehousing and Incremental Evaluation for Web Site Management, Draft, 1998.

[GW97] Roy Goldman and Jennifer Widom, DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases, Draft, 1997.