An Interchange Standard and System for Browsing Digital Documents
M.Eng. Thesis Proposal
Andrew Kass
Thesis Supervisor: Prof. Jerome Salzter
December 11, 1994
1.0 Introduction
As libraries become increasingly more computerized, not just card catalogs but entire books will be stored on-line. In addition, with the advent of fast global digital communication networks, information will increasingly be delivered in electronic form. For immediate transfer of data, a plethora of methods, formats and protocols may be used, both for the transfer, as well as for the content itself. However, for long term archival of documents and transferring between different systems, a standard way of representing a document is needed. In addition, documents have additional information which is not necessarily reflected in the content of the document, such as copyrights, publisher, ISBN number, etc. In addition, this information, even if present, needs to be available in an architected way for displaying. This "meta-information" needs to travel with the content of the document, regardless of the format the content is in.
By defining a digital document and creating a standard for its representation, it is possible to solve all of these problems. By allowing the content to be in any format, but having the digital document describe the format, it is possible to use any format, present or future, for the content itself. By creating a standard way to describe the document's meta-information and content format, it is possible to archive documents for long periods of time without having to worry about what format it is stored in. Finally, documents can be stored and delivered in multiple formats, making it possible to view both the scanned images and ASCII text of a document.
2.0 Digital Documents
In order to create on-line documents, we need to define what a digital document is. In my model, a digital document is data in arbitrary format accompanied by meta-information which is part of a document, but not necessarily contained in the document itself. In order to easily transfer digital documents, I am creating a set of MIME types, which describe the format of a document and contain the meta-information, while still allowing arbitrary data formats.
2.1 Digiments
What exactly do we mean by a document? A document can be a book. It can be a magazine or periodical. It can be a bibliography of a technical field. A technical report is a document. Instruction manuals, TV listings, newspapers, rulebooks, and catalogs can all be considered documents. What do we mean by a digital document, or a document stored on a computer? A digital document can be composed of a collection of images in GIF, JPEG, TIFF or other format. It can be a Postscript, LaTeX or ASCII text file or set of files. It can be a set of images, the ASCII text, and a text to image map which relates the two. It can include sound. In short, there is no easy definition for a digital document. Nor do digital documents correspond well to what we call documents in the physical world.
For these reasons, we need to create a new type of object, called a digiment, for digital document. Defining a digiment will allow us to achieve some of the goals expressed above. It will provide a standard way to pass around digital documents, regardless of the format in which the content itself is stored. It will provide a standard for archiving and retrieving digital documents. It will provide for the use of multiple representations of a single "document" that can be linked together, as with a text to image map. It is intended to accommodate the use of future data formats transparently. And it will provide meta-information about a document, such as author, publisher, and copyrights. For comparison, in a book, the text on the pages of the book is the content, while the meta-information is found on the inside of the cover page, such as ISBN number, copyright and publisher.
The digiment standard is not a specific data format for storing or transmitting digital documents. Rather, it is a container for transmitting or storing documents. A digiment consists of the data itself, which can be in any form, and a Page Of Record, or POR. The POR contains the information which differentiates a digiment from a simple set of images or data files. The POR specifies the format which the data itself is stored in, whether it is image, text, Postscript, etc. The POR also contains meta-information about the digiment, such as the author, publisher, copyrights, distribution rights, and page maps. By specifying the digiment in this way, it is possible to use any type of format for storing the actual data, including formats which may be created in the future.
2.2 MIME types
In order to transmit digiments, a new MIME type needs to be created. MIME stands for Multipurpose Internet Mail Extensions, and was originally developed to allow Internet email messages to contain images, sounds, applications and other data types besides plain text. However, MIME types can be used to describe the content of any type of data transfer and is increasingly used when transferring data over the internet, as with HTTP. This allows a program, such as Mosaic, to receive arbitrary data, and then determine how to interpret the data based on the MIME type.
By creating a digiment MIME type, we can separate the concept of a digiment from the actual representation which the data uses. The digiment MIME type contains a description of the format which the actual data uses, other formats which are available, how the formats are linked, and all the other meta-information which is contained in the POR.
MIME types consist of a type and subtype pair, in the form "type/subtype". By defining a type "digiment", we are able to create multiple subtypes which describe the content of the digiment. This allows a digiment browser to specify which data formats it can display, as well as which format it prefers. For example, these are possible digiment MIME types:
MIME type/subtype Description digiment/image Digiment stored with pages as bitmapped images digiment/pdf Digiment stored in a Portable Document Format digiment/text Digment stored in textual representation digiment/map Digiment stored as map between different formsA given MIME type/subtype specifies the format of the data which is being transferred. By specifying a data transfer as "digiment/subtype", a client would know to launch a digiment-aware browser which could then interpret the digiment as a true document, instead of simply a set of pages of text.
3.0 Digiment Browser
In order to read digiments, one must have a digiment browser. The digiment browser accepts the digiment MIME types and displays the digiments on the user's workstation. A digiment browser must be able to accept digiments in any format and either display the digiment itself, or use another application to display the digiment. In addition to displaying the content of the digiment, the browser uses the meta-information contained in the digiment. This may include showing copyright information with the content, mapping image numbers to physical pages for scanned documents, providing references to other digiments, or mapping the text of a digiment to page images.
A typical scenario for the use of a digiment browser is as a helper application for Mosaic. When Mosaic (or any other WWW browser) receives data with a digiment MIME type, it passes the data stream to the digiment browser, which can then display and manipulate the digiment. A good digiment browser should be optimized for display speed, ease of use, and support the full digiment MIME specification.
4.0 Research Goals
My research has two main goals; defining the minimum specification for a digital document, and creating a system for transferring and displaying them. This involves defining what a digiment is and how to transfer them, and creating a browser capable of using them. This research is designed to produce both a digiment MIME type and a useable digiment browser. During the course of this research, I have two design criteria.
1. The system is designed to be widely and cheaply distributed in a short time frame. The browser will be made available at the conclusion of the research. This system is intended to be a standard for digital document use.
2. The system will be extensible to future data types and meta-information. My thesis will describe version 1.0 of the digiment MIME type, which will be the current definition of a digital document. However, the standard will be written in such a way that it will be possible to create newer versions of the MIME type which incorporate additional data and types, transparently to existing servers and client.
I am also looking at two other projects which may be of interest. The RightPages project at AT&T Bell Labs is an image based electronic library. As such, it stores books as electronic documents, and has its own browser as part of the system. The Virtual Paper project at DEC Systems Research Center is an attempt to create a paperless office. The goal is to make browsing electronic documents preferable to the paper form. It also has its own browser, Lectern. No papers have been published on this project yet. Both of these projects use proprietary data storage formats for the documents.
By choosing a good definition of the digiment MIME type and implementing a simple browser which uses it, I hope to accomplish my goals in accordance with both the design criteria.
5.0 References
[1] Saltzer, J. H., "Technology, Networks and the Library of the Year 2000," Proceedings of the International Conference on the Occasion of the 25th Anniversary of Institut National de Recherche en Informatique et Automatique, Paris, France, December 1992.
[2] Bornstein, N., and Freed, N., "Multipurpose Internet Mail Extensions Part One", RFC 1341, Bellcore Network Working Group, September 1993.
[3] Story, G., O'Gorman, L., et. al, "The RightPages Image Based Electronic Library for Alerting and Browsing", Computer, September 1992, pp. 17-25.
[4] Hoffman, M., O'Gorman, L., et. al., "The RightPages Service: An Image-Based Electronic Library", Journal of the American Society for Information Science, September 1993, pp. 446-452.