This is a position paper written for the workshop on HTML+ at the First International Conference on the World-Wide Web.
The text-image map is intended to serve two purposes: a program can determine which page image and coordinate region of that image contains any word listed in the full text, and a program can identify from a coordinate region in a page image what text words lie in that region.
Though the Web is arguably the most ubiquitous new information distribution in use by the Internet community, it does not provide enough functionality to exploit text-image maps fully. I propose several possibilities for extending the Web to better support these maps.
Coordination between text and image is a necessary feature of any document delivery system for a digital library. Imagine a user look at a techincal report on-line, he sees a reference that sounds interesting and marks it on the image. Now what? Perhaps the types the reference into his text editor, and later looks it up in the library by hand, but the library system should be able to do that work for him -- it should take the reference he marked, find it in the library, and bring the document to his screen.
What is needed is a way to understand the mark on the image as a reference to the text representation of the page. Where the underlying library system can do little with a section of bitmap, it can treat a list of words as a bibliographic reference and look for the work referred to in its holdings, or in other libraries. In the Library 2000 system being developed, we use text-image maps to move quickly and easily between the two different representations. The map is nothing more than a list of words, and the coordinates of their location on the page. Given a pair of x,y coordinates, it requires a simple lookup to determine what word (if any) is at that location; given a word, the coordinates of its location on the page can be looked up.
Because the use of bitmap images as the user-level representation of a document has been relatively rare, existing tools lack some of the features necessary to use images and to coordinate them with text-image maps. First, optical character recognition programs, which are the primary means of generating text-image maps, tend to output only they words they recognize and not the location of those words, even though it uses the location information internally. Second, and more importantly, the Web doesn't provide the right user interface for exploiting the map. There is support for passing information about a single user-click, but more the functionality of a traditional point-and-click interface is needed: information about the kind of click -- single click, double click, left button, or right button, for example -- is needed and the ability to send a path -- a bounding box, a circle, etc. -- instead of just a click is also needed.
The remainder of this paper discusses three separate areas: the creation of text-image maps, some applications of the maps and suggested interfaces, and details of WWW support for images.
First, the limitations that suggest large image files are a problem change quickly with time. Disk space is becoming cheaper and more abundant, and high data communication speeds are being realized. Trends in disk drive technology suggest that 100 gigabytes of magnetic storage will cost $2,500 by the turn of the century.
Second, images provide a good archival format for long-term storage. Page images capture important details that text-only representations lack; these details range from typographic conventions to the placement of charts, figures, and graphics. Assuming that this non-textual information is important, bitmaps provide a simple representation that is likely to be understood in 75 years. Other formats, like SGML or PostScript, are more complex and less likely to be understood far in the future.
The use of images is also a practical consideration. Many of the world's documents do not exist in electronic form. The Library 2000 project plans to put all of MIT's computer science technical reports on-line, and for older reports that most efficient scheme is to scan them. We can produce text from the images using OCR software, but OCR tends to introduce many errors. Thus, the image remains as the canonical form of the document and can compensate for errors in the OCR-produced text.
Given that bitmapped images are an important representation, the need for coordinating the image with the text is important. Two methods immediately suggest themselves for building text-image maps. One could either take the Postscript representation of the document, and extract the text and position information from that or use OCR software to convert the image representation to text. In practice, the first method is limited to documents that exist in Postscript form, and even then success is highly dependent on the program that generated the original Postscript. As a result, the use of OCR software is often necessary, and in any event, more effective that using Postscript.
Translating from Postscript to text tagged with position information is a hit-or-miss affair. All of the tools that exist for this purpose rely on the program used to produce the Postscript. The most consistently successful tool is a utility distributed with Ghostscript, ps2ascii.ps. This tool works relatively well with Microsoft Word produced Postscript, but does not understand Latex-produce Postscript.
The text-image maps currently used by Library 2000 were generated from documents stored in Xerox's Xdoc format [2], a generalize page mark-up language. The Xdoc data is provided by Cornell University's CS-TR project, and was generated from 600dpi monochrome scans of Cornell's computer science technical reports.
The Xdoc format is geared towards describing pages of text: it describes a page with information about the size of the fonts used and position information about each individual line of text, as the OCR packages recognizes it. Within a line of text, it describes the words as plain ASCII text and describes the location and extent of whitespace. It is a xsav b straightforward task to determine the location of each word from this information.
A sample of the Xdoc format is below.
[s;3;372;63;903;p;1;2]Notice[h;549;19]that[h;641;22]in[h;696;20]this[h;780;22] model[h;910;19]the[h;985;20]cost[h;1072;22]of[h;1128;19]sending[h;1280;21]a [h;1320;19]message[h;1483;19]from[h;1584;21]one[h;1664;21]vertex[h;1795;20]to [y;1855;5;903;2;S] [s;3;372;1;976;p;1]another[h;509;18]is[h;551;19]proportional[h;791;20]to [h;845;18]the[h;918;18]shortest[h;1074;19]path[h;1173;19]between[h;1336;17] the[h;1408;18]two[h;1487;18]vertices,[h;1651;19]that[h;1741;19]is,[h;1797;19] to[y;1855;4;976;2;S]
Currently, the text-image maps are precomputed for available XDOC pages. They are stored as lists of words and their bounding boxes. The maps tend to be about 1 or 2 kilobytes smaller than the 5-10KB XDOC files. The parser could easily by extended to do on-the-fly generation of maps from XDOC files.
The prototype system both suggests the utility of text-image coordination and highlights the weakness of the Web in handling that coordination. The system uses GIF images to represent the pages; each image is contained within an <A HREF> tag and inlined with the <ISMAP> tag. The ISMAP feature allows users to click on a point in the image and send an HTTP request that includes the position of the click. The ISMAP feature is not a part of the official HTML specification [1], but is widely used; it is described in the NCSA httpd server documentation [IM] and it included in the HTML+ specification [3].
When a user clicks on the image, a Perl script uses the x, y coordinates of the click and the text-image map to determine the exact word the user clicked on and all the words on the same line. It returns an HTML documents which offers two options. Each feature is useful, but the interface is clunky.
First, the program can pass the entire line of text to a program that treats the text as a bibliography citation and looks it up in the card catalog. This option works so long as the reference is a single line of text. If a reference extends across several lines, then only part of the reference can be retrieved. If the reference uses only part of a line, then extraneous words retrieved may affect the quality of the search. A better interface for this feature would be to draw a box around the reference with a click-and-drag, or to highlight from any arbitrary character to any other as if it were text in a word processor; current work supports neither of these interfaces.
Second, if the word is a number, it is treated as a page number and the user can jump to that page. The single click interface seems appropriate for this feature.
A further problem presents itself when we consider using these two features in tandem: When I click on an image, what do I intend that click to mean? Is it a click to look up a reference in the card catalog? Is it a click to jump to a different page? It seems that the ability to specify different kinds of clicks (single, double; left, right; normal, with the option-hyper-meta key down) is necessary.
Though the prototype system only implements two features, several other features seem useful and feasible:
The WWW protocol suite also needs to deal with images and imagemaps better. The prototype text-image map system suggests several kinds of support would be useful:
GET /image?0.5+0.7In this message, the x coordinate is 0.5 and the y coordinate is 0.7. The format could be extended to include to pairs of x and y coordinates, one for the upper left corner of the box and another for the lower right. For example:
GET /image?0.3+0.4+0.7+0.5
Using drag-and-click boxes is common in many applications in use today, so extending browsers should be simple. There should be no need for any changes for servers, which pass the coordinate arguments to user-defined maps or programs, which would need to differentiate between single clicks and drag-and-click boxes.
The second and third suggestions are wider in scope, and require greater scrutiny before implementation. In effect, the second solution implements a kind of typing of links -- so that a left click specifies a different kind of link than a right click. It introduces many complications, such as whether typed-links are allowed for text and images and whether multiple links of different types can be specified for the same location. The design also involves strong and divided users preferences for different kinds of mice and the number of buttons they have. In general, deciding how many kinds of clicks should be allowed and how to specify them when creating links is hard.
The third suggestion, for an extension to forms, is an easier to implement but less functional solution. By providing a radiobutton to specify the type of link to follow, the discussion of how many buttons and what kinds of clicks is eliminated. But the cost of eliminating that discussion is two clicks for every action: one to specify the type and a second to specify the location.
The suggestions vary in complexity, ranging from a simple extension of the existing ISMAP feature to an expension of the concept of a link to include different types depending on the kind of mouseclick issued by the user. Further discussion is need in the WWW community, but the utility of some kind of extension seems to merit serious consideration.
[2] Connelley, Daniel S. and Paddock, Beth. XDOC Data Format: Technical Specification. Xerox Imaging Systems part no. 00-07571-00.
[3] Raggett, Dave. HTML+ Hypertext markup format. Discussion document. 8 Nov. 1993.
[4] Saltzer, J. H., "Technology, Networks, and the Library of the Year 2000," in Future Tendencies in Computer Science, Control, and Applied Mathematics, Lecture Notes in Computer Science 653, edited by A. Bensoussan and J.-P. Verjus, Springer-Verlag, New York, 1992, pp. 51-67. (Proceedings of the International Conference on the Occasion of the 25th Anniversary of Institut National de Recherche en Informatique et Automatique (INRIA,) Paris, France, December, 1992.)
This work was supported in part by the IBM Corporation, in part by the Digital Equipment Corporation, and in part by the Corporation for National Research Initiatives, using funds from the Advanced Research Projects Agency of the United States Department of Defense under grant MDA972-92-J1029.