Chapter 3

Previous chapter

Chapter 3

What is a Digiment?

This chapter describes the driving motivations behind creating a digital document, and particularly the digiment format. It also explores the differences between the digiment concept and the traditional concept of a digital document. Finally, it discusses some of the requirements a digiment must have in order to accomplish these goals.

3.1 The Library 2000 Model

To understand how a digiment is used and what it is for, it is helpful to understand the Library 2000 document model. In the Library 2000 system, documents are stored on servers called repositories. Clients who wish to access a document obtain a handle, which points to a specific document stored on some repository. A handle is an identifier that will uniquely specify a document when presented to a repository. Handles can be obtained from a variety of sources, including other clients, cross references from other digiments, or a discovery service, the equivalent of a library card catalog.

A typical scenario for viewing a document may start with a user contacting a discovery service to search for an article on distributed systems. The discovery service returns a list of handles to documents, along with the repositories which store the documents. The user chooses a handle and then presents it to the repository with the intent of looking at the document. At this point, however, what does the repository return? It could return an arbitrary representation of the document. It could return a list of representations available. Perhaps it requires the client to request a specific action when presenting a handle to the repository. However, the most useful thing would be to return some object that completely described the document which is stored on the repository. This is where the digiment becomes useful.

3.2 The Digiment Concept

The idea of a digiment mainly came out of attempting to figure out how to request digital documents from on-line repositories, how to incorporate the information obtained during the scanning and processing process of a document, and more to the point, what digital documents were anyway. In designing a distributed digital library system, we were going to be storing digital documents in order to deliver this information to clients. However, this requires some definition of what a digital document really is, and how to deliver this information in the best way.

3.2.1 Original motivations

When developing the idea of a digiment, there were three main motivations. These ideas drove the concept of creating the digiment format and are essential to its design.

The first motivation came from attempting to answer the question of what a digital repository should return when a client asks it for a document. A repository may have the document in multiple formats; should it require that the client specify a particular format to use? If the document is available as scanned images or in another page based form, should it return the first page, require the client to specify a page, or return something else? Should the repository return bibliographic information or a list of formats? What if accessing the document will result in a charge for the client? What if this repository only contained the Japanese version of a document; should the user have to attempt to read the document before realizing that he needs to look elsewhere? Even more to the point, the creation of Universal Resource Names (URN's) is supposed to create a universally unique namespace for electronically addressable data which can be resolved into a pointer to the data. What part of the document would a URN for a digital document point to?

Clearly, there is a need to define some sort of "default" return type when requesting information about a document from a repository. This type should contain information about the different formats in which the document is available, provide bibliographic information about the document, explain how to access the different versions of the document that are available and what the access charges are, and furnish any other information that would allow the client and/or user to make intelligent decisions about whether and how to view this document.

The second motivation comes from looking at the same question in the specific case of scanned images. What if a client wants to transfer an entire document from a repository in image form? How can this be accomplished when each page is stored as a separate file? One possibility is that the client must discover the number and names of all the files that make up a copy of the document and retrieve them individually. However, existing solutions, such as multipart TIFF images, work only for that specific image format. Defining a compound document that contains each of the individual images would be a much better solution.

Finally, when scanning documents into a computer, we captured a lot of information that could be useful, most notably page maps. Each scanned image was associated with a page number assigned by a human operator, attempting to capture the original page numbering scheme as completely as possible. Furthermore, when processing the original scanned images into different formats for viewing and printing, we know which images in different formats were derived from the same original page, and therefore are alternate versions of the same page. Both these types of information could be very useful to a browser attempting to display a digital document.

3.2.2 What are documents?

When trying to think about digital documents, it helps to define what a traditional document is and how a digital document differs from that definition. This is not such an easy task, however. If you ask five different people what a document is, you will get five different answers.

What are some examples of documents? Certainly a letter, a memo, a report or even a thesis are all types of documents. The definition starts to get blurry when talking about such things as books, magazines, and brochures. But what about things like post cards, instruction manuals, catalogs, flyers and other printed material. By virtue of containing printed information, they can all be considered documents. But what about a book on tape? If the book can be considered a document, is the tape of someone reading that book also a document?

An even worse ambiguity of the term is seen when trying to determine if two documents are the same. Would a Japanese language version of this paper be considered the same document? Most people would say no, but the actual information is identical, just the representation is different. What about a fancy formatted version of this paper versus a printout of just the plain text. Would it still be the same document? Now even the representation is the same, but the layout is different. Is the version of this paper as I originally typed it the same as the version I saved after I ran a spelling checker on it? They are simply different revisions of the same paper.

The problem that we run into is that the definition of a document is closely linked to its physical representation. We are all familiar with the printed page as conveying information, so we can use the term document for most forms of printed material. So when looking at two versions of the same paper that in different languages, you can clearly see that they are not the same and, therefore, must be different documents! However, the real value of a document is the information that it conveys. Thus, it is better to think about a document in the abstract form of its contents, instead of the traditional physical object that we look at and call a document.

In order to try to avoid the ambiguities and differing meanings associated with the term document, I have adopted the word digiment. This allows me to define a new concept that is not loaded with preconceptions, such as the idea of a document.

3.2.3 What might a digiment be?

The idea of a digiment differs significantly from that of a traditional printed document in that it defines a document as the abstract information that it contains rather than the actual physical object. Thus, multiple representations of the same data could be considered to be part of the same digiment. This allows an extraordinary amount of flexibility in created digiments with different concepts of "sameness". One repository may consider different representations of the same document to be separate digiments. It would store one digiment for the scanned images of a document, and another digiment for a Postscript version of the same document. Another repository might consider all representations of the same document to be part of the same digiment. It would have one digiment that contains both scanned images and Postscript. This idea is not just limited to representations. For example, one version of a digiment may contain Japanese, English and Korean versions of a text. However, this could also be broken down into three separate digiments. Depending on whether you consider an object in different languages to be the "same", you can either get a digiment with multiple languages, or a digiment with just the language you want.

Digiments also do not limit the type of data they contain to just texts. A digiment can contain video clips, sound clips, images, text, or anything else. These objects can either be differing representations of the same thing, such as high and low resolution images of the same page, or they can be complementary, such as a book about animals with sound clips of each animal.

3.3 The Digiment Concept Versus the All-in-one Concept

There has been a lot of work recently on trying to create a method for transferring digital documents. Adobe has created the Acrobat Portable Document Format, or Acrobat PDF, which attempts to capture enough information about a document such that it can be viewed or printed even without the application that created it. Apple is attempting to integrate a similar system into its MacOS operating system with QuickDraw GX. No Hands Software created Common Ground, yet another portable document format. However, all of these formats have one thing in common; they define a new, proprietary format and require that all the data that they contain be converted to that format. These formats are what I describe as the all-in-one format; they attempt to define their own formats for every type of data they contain and store all of its contents in these formats.

However, this approach is not really any different than any application defining its own storage format. The only difference between a PDF and the format in which WordPerfect stores its documents, for example, is that the PDF format is designed to hold more types of data, since it must be a storage format for more types of programs than just a word processor. In addition, it may contain system resources that may not be available on other systems, such as fonts. But the basic concept is still essentially the same; define a monolithic storage format that can be used by a single, large application that is capable of understanding the entire format.

This approach has two intrinsic drawbacks, however. The first problem is that it requires any program that can read the format to understand each of the data types and be able to work with them. As the format expands to include more types of data, understanding the data types becomes increasingly complex. The program must know how to work with or display an ever increasing number of different data formats.

The second problem is that the entire format must be modified when adding new data types. This leads to a release of a new version of the Acrobat format whenever a new feature or type needs to be added. This new format in turn requires that new Acrobat browsers and creators be released. Everyone knows the frustration of trying to read a Word 5.1 document, when all you have is the program Word 4.0. The only way around this problem is to try to shoehorn a new data type into an existing data type definition, which usually results in loss of some important information.

The digiment format takes a different approach. Instead of attempting to define a single format for every type of data, it is more like an envelope into which you can place arbitrary data types while preserving relationships among them. This approach is more like the one taken by component object systems, such as OpenDoc and OLE 2.0. These schemes also use a "container" based approach; instead of defining a format that must be able to represent each of the data types within it, they define a container object that has data of arbitrary types within. Thus, an application that is very specialized can still open the container and work with only the data objects with types that it understands.

The digiment format solves the two problems that occur with all-in-one formats. First, a digiment browser does not need to understand the individual formats contained within the digiment. Since all the data types are standard, if it doesn't know how to handle a particular data type, it can pass that data object to another program that does. This only requires that each data object has an easily identifiable type associated with it, which can be accomplished by using MIME types (see "What is MIME?" on page 33). In fact, one of the advantages of having multiple representations of a document is for people who may not have the software to view one representation, but can recognize a different one.

Secondly, new types can be easily added to a digiment without requiring any changes to the standard or to the browsing software. All that is required is for the browser to add a new entry to a table containing a list of data types and actions to perform on each one. For example, if you wanted to browse a document that contained a MPEG movie, instead of having to get a version of a digiment browser that knows about MPEG movies, you could simply use any software that supports MPEG and tell the browser to use that player whenever it sees a MPEG object.

To illustrate the difference between the two, a digiment can contain a PDF object within it. This can be viewed by any client that can read the PDF format. However, the same digiment can also contain the original word processing, spreadsheet and database files that created the PDF, as well as voice annotations and a Postscript version of the PDF.

3.4 Requirements for a Digiment

There are many reasons why having the concept of a digiment as outlined above is useful. The following is a list of some of the more important reasons that were key criteria in the development of an actual digiment type.

3.4.1 Documents come in multiple formats

Digital documents tend to come in a variety of different formats. Older documents that have been scanned may be available in a high resolution greyscale or color format suitable for archiving and preserving the maximum amount of information about the original. These original scans may also be processed into high resolution monochrome images suitable for printing on laser printers. They may also be made available in lower resolution images suitable for viewing on a workstation screen. Thumbnail images of multiple pages may be available to quickly select a desired page. OCR software may be used on the images to produce a plaintext version of the document.

Even newer documents, which may already be available in electronic form, can have different formats. A Postscript document may be available in two different flavors; one that is a complete file ready to print, and another version with the fonts not included in order to reduce transmission time for those users who already have the correct fonts. Additionally, a PDF file may be available. Finally, individual page images may be generated from the Postscript original for people to view who do not want to transfer the entire Postscript file at once or who do not have a Postscript viewer.

A digiment format should be able to handle documents that have parts available in multiple representations.

3.4.2 Documents have different parts in different formats

A similar problem occurs when some parts are available in certain formats while others are not. For example, a scanned document may be available in greyscale, but with an additional color image for page 12 since that page has a color picture on it. Additionally, some parts of the document may be in formats that can not translate into alternate formats. For example, a digital document may contain a sound clip. The sound would not be available in any other format, and is not meant to be an alternative to another format, but an addition. A digiment format should allow documents to have multiple parts that may be displayed simultaneously.

3.4.3 There are relationships among different parts

Some parts in a digital document may be related to other parts. For example, a text-to-image map may relate plaintext to a specific sequence of bitmapped images. The text-to-image map needs to be able to point at both the plaintext and the specific images. A page-map may contain a mapping from a specific image to the page number that the author intended. When different formats are available, the relationship information is very important, in that it allows a client to know which parts are substitutable, or even desired, in place of part in another format. A digiment should be able to reference each part uniquely and keep track of the relationships among them.

3.4.4 Bibliographic information needs to travel with the document

Not all of the information about a document is contained within the body of the document itself. For example, the name of the author, the copyright holder, the distribution rights, and even the title of the document are important pieces of information that may not be contained within the body itself. The information found inside the cover page of a book is an example of bibliographic information about the book. Even if this information is included in the body of the document, it is usually extremely difficult to extract from Postscript, scanned images or even ASCII text, as the location is not standard, nor is the format of the information.

By defining a specific format for bibliographic information and including this as part of the document, a digiment can provide a way to extract information about a document in an architected manner.

3.4.5 We want to refer to the content of a digiment by reference only

Not everyone who requests a digital document will necessarily want to transfer the entire document in every format. For example, someone without a Postscript interpreter would have no need for a Postscript version of the document if all they wished to do was view it somehow. If a specific page is available in both greyscale and color versions, the user has no need for the greyscale version if he wishes to view the color version, and has no need for the color version if he does not have a color monitor! Therefore, instead of requiring that the digiment contain the actual data of the document, which may come in many formats and thus take an extremely large amount of bandwidth to transfer, the digiment should contain the data by reference. This allows a client to obtain everything they need to know about a document without having the incredible overhead of transferring the data itself, and choose exactly what and when it wants to transfer and display. This can allow a client to be much more efficient in terms of caching and bandwidth utilization.

3.5 How to Create a Digiment

Once we have laid out the basic requirements for a digiment, we need to define a new type of digital object, which is the actual digiment. This object should be structured in such a way that all of the relevant information about a document discussed above can be easily stored and accessed. The object should be modular, permitting easy addition of new formats and structural information. It should be easy to parse and easily recognizable as a digiment. Furthermore, it should not restrict the types of data which it can contain.

I have chosen a format for digiments which relies heavily on MIME, or Multipurpose Internet Mail Extensions. MIME provides two very useful concepts; the ability to embed sub-objects within objects, and the ability to associate a data type with each object. By creating a digiment out of multiple sub-objects, a digiment can easily be created by simply selecting a legal set of these sub-objects. New formats can be added by incorporating more sub-objects. In addition, since the digiment format uses MIME types, it is easy to add arbitrary data types, as long as they are tagged with a valid MIME type. The digiment itself is simply a new MIME type, which can be passed around like any other digital object.

The actual objects which make up a digiment are described in Chapter 4.