Chapter 2

Previous chapter

Chapter 2

Related Work

This chapter describes prior and ongoing work that is related to the transmission and viewing of digital documents.

2.1 Existing Research on Digital Documents

In recent years, the use of digital documents has started to become practical. With high bandwidth networks and a proliferation of workstations that are able to view digital documents, research into both delivery and viewing of digital documents has increased dramatically. This research can be classified into two broad categories; electronic delivery systems and digital document systems.

However, virtually all of this work is focused on directly mapping current definitions of documents into the digital domain. Most systems are only concerned with replacing traditional paper objects with an electronic version. Most document delivery systems are concerned with transmitting a single version of a document to a remote location, to either replace or supplement a paper copy. Most digital document systems are focussed on optimizing different parts of the browsing process in order to simulate reading a traditional paper document. With the exception of System 33, one of several described below, none of these systems has really attempted to use new paradigms that digital documents can now embrace, such as removing the semantic content of a document from its representation.

2.2 Document Delivery Systems

Electronic document delivery systems are systems that are designed to transfer electronic versions of documents from one computer to another. However, most systems simply attempt to replace the delivery of paper documents with electronic versions, instead of using any expanded notion of what documents are.

2.2.1 GEDI

The Group on Electronic Document Interchange is a consortium of 7 libraries and other parties who have defined a standard for electronic document interchange, called the GEDI Recommendations [2]. The goal of this group was to create a standard way to distribute electronic documents between participants. Most of their work went towards defining standard delivery protocols and mechanisms, as well as choosing a standard representation for the document. A GEDI document is a series of 300 DPI TIFF images, using CCITT G4 compression, along with some header information describing the document.

However, the GEDI system was only designed to deliver a single representation of a document at a time. Moreover, it was not designed to create a definition for a digital document, but simply to designate common protocols to transmit some version of a document from one party to another.

2.2.2 TULIP

The TULIP (The University LIcensing Project) is joint project between Elsevier Science Publishing Group and several Universities [8]. TULIP supplements paper subscriptions to journals with electronic versions of the journals. The TULIP project again is not attempting to create a new digital document system, but to create an electronic version of existing publications, specifically research journals. Because of this goal, the TULIP data structures are geared towards representing journals, and not documents in general. In fact, the TULIP project does not even have the notion of a document, but of a dataset, which can contain one or more journals. Although each journal and article has bibliographic content associated with it, the structural content described is limited to journals, articles and pages. Because of these design restrictions, TULIP cannot be easily generalized to a generic digital document system.

2.3 Digital Document Systems

The other broad category of research is entire systems to browse digital documents. However, most of these systems are mostly concerned with optimizing parts of the browsing process. This leads to the development of proprietary document formats that allow only specific representations that are optimized for display speed, resolution independence, etc. In addition, virtually all the systems are designed with the goal of replacing traditional paper documents, and only support scanned images and text. In contrast, a digiment is meant to be a generic form of a digital document that can contain any type of data, including proprietary formats that have been optimized for some property, as well as other types such as video or audio.

2.3.1 System 33

System 33 is a system developed at Xerox PARC designed to provide electronic document services [10]. Of all the systems surveyed, System 33 seemed to incorporate more of the concepts that were used in the digiment format than any other system. System 33 was designed to differentiate the abstract notion of a document from the actual representations of the document, which is one of the key concepts that drove the digiment format. In System 33, users can request different "presentations" of a document, which is a specific representation of a document. The server will then convert the document to the correct format on the fly.

However, System 33 does not create an actual object to represent a document. Documents are accessed by means of a handle; by presenting the handle to the server, a client can retrieve various information about a document, such as the available presentations or a description of the document. By presenting the handle along with presentation specific information, a client can retrieve the location of the particular presentation. The difference between this and a digiment is that a digiment is an object that contains all of this information, and can be returned to a client as a single object. In addition, the digiment contains structural information about the document, with System 33 only provides at a rudimentary level.

2.3.2 RightPages

The RightPages project is an electronic library prototype developed by AT&T Bell Laboratories [6,12]. The main focus of the project is to allow users to browse through an electronic database of journals and view the articles on-line. This system allows users to view scanned images of the journals' contents, as well as text that has been derived from Optical Character Recognition software.

However, this system is limited to only two representations of the documents; a bitmap image and the text content. In addition, the system keeps a mapping of "logical fields" in the document, such as the location of the Table of Contents entries. Because this system was designed only to view scanned images, it was not designed to be extensible to arbitrary data formats.

2.3.3 Lectern

Lectern is a document browser that is being developed as part of the Virtual Paper project at DEC Systems Research Center [7]. The primary goal of this project is to make a document browser that is good enough that people prefer to use it rather than paper documents. Lectern documents are stored in two representations; a proprietary compressed image format that has been optimized for speed, and the text of the document, with word positioning information with respect to the images. The system also includes a simple "property list" ability; documents are allowed to have arrays of (name, value) pairs of strings.

Like the RightPages system, this system focuses on replacing traditional paper documents. As such, the Lectern viewer only works with images in its proprietary format and its associated text version. This system was not designed to be a generic digital document transmission and browsing system, but to develop a specific format that could be used for certain types of documents.

2.4 MIME

MIME, or Multipurpose Internet Mail Extensions, was developed to allow the sending of data objects in arbitrary formats through standard Internet email. The current standard for Internet email, which is defined by RFC822 [5], does not make provisions for 8-bit data, lines longer than 79 characters, or multiple objects within a single message. The MIME format, which is defined in RFC1521 [1] and RFC1522, was designed to circumvent these limitations.

One of the main purposes of MIME, defining an encoding for binary data into 7-bit lines of 79 or fewer characters, is not very useful for most non-email network connections, which place no restrictions on the type of data that is being transferred. However, the MIME standard does contain two features that make it useful for many non-email based applications.

The first feature is the ability to label data objects with types and IDs. The type label allows a program to manipulate an object without having to actually understand the format itself; the program keeps a list of "helper applications" that can understand different types. When the program receives a data object that it does not know how to handle itself, it can look at the object's type and determine which helper application understands that type, so it can pass the object to that application. MIME type labels are currently the method of choice for determining the data type of a WWW transaction, for example. The ID label allows data objects to refer to each other with an unique name, so objects can build relationships among themselves.

The second feature is the ability to embed multiple objects within one "multipart" object. This allows multiple, independent objects to travel together as a single larger object.

The digiment format uses both MIME labels and the MIME multipart mechanism. By defining a new MIME type for a digiment, existing MIME compliant software will be able to treat a digiment as simply another MIME object. This allows them to hand a digiment over to a digiment browser without any modifications. All that is needed is to add a mapping from the digiment MIME type to the digiment browser. The digiment format also uses the MIME multipart mechanism to embed many different types of data within a single digiment object. Finally, by using the MIME ID label, the individual data parts within the digiment can refer to each other.

2.5 OpenDoc and OLE

OpenDoc [9] and OLE [3] are both relatively recent systems that are aimed at defining compound documents at the system level. These systems allow the creation of documents that can contain multiple data parts of any type. Data parts can even be programs themselves. Such systems build upon object oriented programming techniques to define all data as objects, which can have attributes and other objects embedded within them. For example, a word processing document can have a spreadsheet embedded within it somewhere; clicking on the spreadsheet will automatically use your spreadsheet program to edit it. This allows documents to actually contain other data types, instead of just taking a "snapshot" of another format at a specific time and converting it to the document's format.

However, these systems are simply frameworks. They do not specify any structure for the data itself, simply a way for other programs to do this. A digiment could be created as a special form of an OpenDoc or OLE document.