Text-Image Maps for Document Delivery on the Web

Jeremy Hylton
Library 2000
MIT Laboratory for Computer Science

This is a position paper written for the workshop on HTML+ at the First International Conference on the World-Wide Web.


Text-image maps are a valuable tool for coordinating the full text and page images of documents in electronic libraries. In different situations, either the text or the image is a more convenient representation -- and text-image maps make it easy to change between representations as the situation dictates.

The text-image map is intended to serve two purposes: a program can determine which page image and coordinate region of that image contains any word listed in the full text, and a program can identify from a coordinate region in a page image what text words lie in that region.

Though the Web is arguably the most ubiquitous new information distribution in use by the Internet community, it does not provide enough functionality to exploit text-image maps fully. I propose several possibilities for extending the Web to better support these maps.

What are text-image maps?

The World-Wide Web is fact becoming the protocol of choice for distribution of information on the Internet, but it lacks important features that are needed if that distribution is to include digital libraries. Creators of digital libraries are faced with the dilemma of having two different and important representations for the documents they provide: text and image. The Web provides support for either representation, but little support for any coordination between the two.

Coordination between text and image is a necessary feature of any document delivery system for a digital library. Imagine a user look at a techincal report on-line, he sees a reference that sounds interesting and marks it on the image. Now what? Perhaps the types the reference into his text editor, and later looks it up in the library by hand, but the library system should be able to do that work for him -- it should take the reference he marked, find it in the library, and bring the document to his screen.

What is needed is a way to understand the mark on the image as a reference to the text representation of the page. Where the underlying library system can do little with a section of bitmap, it can treat a list of words as a bibliographic reference and look for the work referred to in its holdings, or in other libraries. In the Library 2000 system being developed, we use text-image maps to move quickly and easily between the two different representations. The map is nothing more than a list of words, and the coordinates of their location on the page. Given a pair of x,y coordinates, it requires a simple lookup to determine what word (if any) is at that location; given a word, the coordinates of its location on the page can be looked up.

Because the use of bitmap images as the user-level representation of a document has been relatively rare, existing tools lack some of the features necessary to use images and to coordinate them with text-image maps. First, optical character recognition programs, which are the primary means of generating text-image maps, tend to output only they words they recognize and not the location of those words, even though it uses the location information internally. Second, and more importantly, the Web doesn't provide the right user interface for exploiting the map. There is support for passing information about a single user-click, but more the functionality of a traditional point-and-click interface is needed: information about the kind of click -- single click, double click, left button, or right button, for example -- is needed and the ability to send a path -- a bounding box, a circle, etc. -- instead of just a click is also needed.

The remainder of this paper discusses three separate areas: the creation of text-image maps, some applications of the maps and suggested interfaces, and details of WWW support for images.

Text and Image: Why and How

The use of images as one of two important storage formats gives rise to a common question: Why use images at all? The question seems natural coming from a community used to 100 megabyte hard drives and networks that realistically provide 100 kilobyte per second connections. The answer to the question comes from some important design considerations, discussed by Saltzer [4].

First, the limitations that suggest large image files are a problem change quickly with time. Disk space is becoming cheaper and more abundant, and high data communication speeds are being realized. Trends in disk drive technology suggest that 100 gigabytes of magnetic storage will cost $2,500 by the turn of the century.

Second, images provide a good archival format for long-term storage. Page images capture important details that text-only representations lack; these details range from typographic conventions to the placement of charts, figures, and graphics. Assuming that this non-textual information is important, bitmaps provide a simple representation that is likely to be understood in 75 years. Other formats, like SGML or PostScript, are more complex and less likely to be understood far in the future.

The use of images is also a practical consideration. Many of the world's documents do not exist in electronic form. The Library 2000 project plans to put all of MIT's computer science technical reports on-line, and for older reports that most efficient scheme is to scan them. We can produce text from the images using OCR software, but OCR tends to introduce many errors. Thus, the image remains as the canonical form of the document and can compensate for errors in the OCR-produced text.

Given that bitmapped images are an important representation, the need for coordinating the image with the text is important. Two methods immediately suggest themselves for building text-image maps. One could either take the Postscript representation of the document, and extract the text and position information from that or use OCR software to convert the image representation to text. In practice, the first method is limited to documents that exist in Postscript form, and even then success is highly dependent on the program that generated the original Postscript. As a result, the use of OCR software is often necessary, and in any event, more effective that using Postscript.

Translating from Postscript to text tagged with position information is a hit-or-miss affair. All of the tools that exist for this purpose rely on the program used to produce the Postscript. The most consistently successful tool is a utility distributed with Ghostscript, ps2ascii.ps. This tool works relatively well with Microsoft Word produced Postscript, but does not understand Latex-produce Postscript.

The text-image maps currently used by Library 2000 were generated from documents stored in Xerox's Xdoc format [2], a generalize page mark-up language. The Xdoc data is provided by Cornell University's CS-TR project, and was generated from 600dpi monochrome scans of Cornell's computer science technical reports.

The Xdoc format is geared towards describing pages of text: it describes a page with information about the size of the fonts used and position information about each individual line of text, as the OCR packages recognizes it. Within a line of text, it describes the words as plain ASCII text and describes the location and extent of whitespace. It is a xsav b straightforward task to determine the location of each word from this information.

A sample of the Xdoc format is below.


The current implementation of text-image maps uses a Perl script to parse the XDOC source. The final lines of each XDOC file contain information about the size of the type used, and each individual line contains information about the extent of the text on that line. From this information, and the position of whitespace, a bounding box is generated for each word. The size of the bounding box is expanded to include a portion of the whitespace surrounding the word.

Currently, the text-image maps are precomputed for available XDOC pages. They are stored as lists of words and their bounding boxes. The maps tend to be about 1 or 2 kilobytes smaller than the 5-10KB XDOC files. The parser could easily by extended to do on-the-fly generation of maps from XDOC files.

Using Text-Image Maps

There are many applications of a text-image map. A prototype system has been implemented using 50 Cornell technical reports. The system uses text-image maps to link references in the reports to a card catalog lookup program and page numbers in the table of contents to the actual pages. The demonstration is here.

The prototype system both suggests the utility of text-image coordination and highlights the weakness of the Web in handling that coordination. The system uses GIF images to represent the pages; each image is contained within an <A HREF> tag and inlined with the <ISMAP> tag. The ISMAP feature allows users to click on a point in the image and send an HTTP request that includes the position of the click. The ISMAP feature is not a part of the official HTML specification [1], but is widely used; it is described in the NCSA httpd server documentation [IM] and it included in the HTML+ specification [3].

When a user clicks on the image, a Perl script uses the x, y coordinates of the click and the text-image map to determine the exact word the user clicked on and all the words on the same line. It returns an HTML documents which offers two options. Each feature is useful, but the interface is clunky.

First, the program can pass the entire line of text to a program that treats the text as a bibliography citation and looks it up in the card catalog. This option works so long as the reference is a single line of text. If a reference extends across several lines, then only part of the reference can be retrieved. If the reference uses only part of a line, then extraneous words retrieved may affect the quality of the search. A better interface for this feature would be to draw a box around the reference with a click-and-drag, or to highlight from any arbitrary character to any other as if it were text in a word processor; current work supports neither of these interfaces.

Second, if the word is a number, it is treated as a page number and the user can jump to that page. The single click interface seems appropriate for this feature.

A further problem presents itself when we consider using these two features in tandem: When I click on an image, what do I intend that click to mean? Is it a click to look up a reference in the card catalog? Is it a click to jump to a different page? It seems that the ability to specify different kinds of clicks (single, double; left, right; normal, with the option-hyper-meta key down) is necessary.

Though the prototype system only implements two features, several other features seem useful and feasible:

Support Needed for Text-Image Maps

Support is needed on two fronts to make text-image maps a credible tool for distributing documents that are stored as both text and image. First, makers of OCR software need to provide information about the position of text on a page; currently, ScanWorX is the only major commercial package to provide such a feature.

The WWW protocol suite also needs to deal with images and imagemaps better. The prototype text-image map system suggests several kinds of support would be useful:

Implementation of the suggestion for specifying rectangular regions seems relatively straightforward. The current ISMAP feature provides for calls like
GET /image?0.5+0.7
In this message, the x coordinate is 0.5 and the y coordinate is 0.7. The format could be extended to include to pairs of x and y coordinates, one for the upper left corner of the box and another for the lower right. For example:
GET /image?0.3+0.4+0.7+0.5

Using drag-and-click boxes is common in many applications in use today, so extending browsers should be simple. There should be no need for any changes for servers, which pass the coordinate arguments to user-defined maps or programs, which would need to differentiate between single clicks and drag-and-click boxes.

The second and third suggestions are wider in scope, and require greater scrutiny before implementation. In effect, the second solution implements a kind of typing of links -- so that a left click specifies a different kind of link than a right click. It introduces many complications, such as whether typed-links are allowed for text and images and whether multiple links of different types can be specified for the same location. The design also involves strong and divided users preferences for different kinds of mice and the number of buttons they have. In general, deciding how many kinds of clicks should be allowed and how to specify them when creating links is hard.

The third suggestion, for an extension to forms, is an easier to implement but less functional solution. By providing a radiobutton to specify the type of link to follow, the discussion of how many buttons and what kinds of clicks is eliminated. But the cost of eliminating that discussion is two clicks for every action: one to specify the type and a second to specify the location.


I have proposed three areas of consideration for extending the Web to support text-image maps. The maps provide a valuable tool for using the Web to deliver images and text in an electronic library.

The suggestions vary in complexity, ranging from a simple extension of the existing ISMAP feature to an expension of the concept of a link to include different types depending on the kind of mouseclick issued by the user. Further discussion is need in the WWW community, but the utility of some kind of extension seems to merit serious consideration.

Works Cited

[1] Berners-Lee, Tim. HyperText Markup Language Specification. IETF Internet Draft, version 1.2.

[2] Connelley, Daniel S. and Paddock, Beth. XDOC Data Format: Technical Specification. Xerox Imaging Systems part no. 00-07571-00.

[3] Raggett, Dave. HTML+ Hypertext markup format. Discussion document. 8 Nov. 1993.

[4] Saltzer, J. H., "Technology, Networks, and the Library of the Year 2000," in Future Tendencies in Computer Science, Control, and Applied Mathematics, Lecture Notes in Computer Science 653, edited by A. Bensoussan and J.-P. Verjus, Springer-Verlag, New York, 1992, pp. 51-67. (Proceedings of the International Conference on the Occasion of the 25th Anniversary of Institut National de Recherche en Informatique et Automatique (INRIA,) Paris, France, December, 1992.)


The author would like to thank Jerome Saltzer and Mitchell Charity for their support, input, and advice during the course of this work.

This work was supported in part by the IBM Corporation, in part by the Digital Equipment Corporation, and in part by the Corporation for National Research Initiatives, using funds from the Advanced Research Projects Agency of the United States Department of Defense under grant MDA972-92-J1029.

Jeremy Hylton, jeremy@the-tech.mit.edu
Last Update: 7 Sept 1994

Links updated November 7, 1997, by jhs (Demo link requires work)
Return to Library 2000 home page.