A typical scenario for viewing a document may start with a user contacting a discovery service to search for an article on distributed systems. The discovery service returns a list of handles to documents, along with the repositories which store the documents. The user chooses a handle and then presents it to the repository with the intent of looking at the document. At this point, however, what does the repository return? It could return an arbitrary representation of the document. It could return a list of representations available. Perhaps it requires the client to request a specific action when presenting a handle to the repository. However, the most useful thing would be to return some object that completely described the document which is stored on the repository. This is where the digiment becomes useful.
The first motivation came from attempting to answer the question of what a digital repository should return when a client asks it for a document. A repository may have the document in multiple formats; should it require that the client specify a particular format to use? If the document is available as scanned images or in another page based form, should it return the first page, require the client to specify a page, or return something else? Should the repository return bibliographic information or a list of formats? What if accessing the document will result in a charge for the client? What if this repository only contained the Japanese version of a document; should the user have to attempt to read the document before realizing that he needs to look elsewhere? Even more to the point, the creation of Universal Resource Names (URN's) is supposed to create a universally unique namespace for electronically addressable data which can be resolved into a pointer to the data. What part of the document would a URN for a digital document point to?
Clearly, there is a need to define some sort of "default" return type when requesting information about a document from a repository. This type should contain information about the different formats in which the document is available, provide bibliographic information about the document, explain how to access the different versions of the document that are available and what the access charges are, and furnish any other information that would allow the client and/or user to make intelligent decisions about whether and how to view this document.
The second motivation comes from looking at the same question in the specific case of scanned images. What if a client wants to transfer an entire document from a repository in image form? How can this be accomplished when each page is stored as a separate file? One possibility is that the client must discover the number and names of all the files that make up a copy of the document and retrieve them individually. However, existing solutions, such as multipart TIFF images, work only for that specific image format. Defining a compound document that contains each of the individual images would be a much better solution.
Finally, when scanning documents into a computer, we captured a lot of information that could be useful, most notably page maps. Each scanned image was associated with a page number assigned by a human operator, attempting to capture the original page numbering scheme as completely as possible. Furthermore, when processing the original scanned images into different formats for viewing and printing, we know which images in different formats were derived from the same original page, and therefore are alternate versions of the same page. Both these types of information could be very useful to a browser attempting to display a digital document.
What are some examples of documents? Certainly a letter, a memo, a report or even a thesis are all types of documents. The definition starts to get blurry when talking about such things as books, magazines, and brochures. But what about things like post cards, instruction manuals, catalogs, flyers and other printed material. By virtue of containing printed information, they can all be considered documents. But what about a book on tape? If the book can be considered a document, is the tape of someone reading that book also a document?
An even worse ambiguity of the term is seen when trying to determine if two documents are the same. Would a Japanese language version of this paper be considered the same document? Most people would say no, but the actual information is identical, just the representation is different. What about a fancy formatted version of this paper versus a printout of just the plain text. Would it still be the same document? Now even the representation is the same, but the layout is different. Is the version of this paper as I originally typed it the same as the version I saved after I ran a spelling checker on it? They are simply different revisions of the same paper.
The problem that we run into is that the definition of a document is closely linked to its physical representation. We are all familiar with the printed page as conveying information, so we can use the term document for most forms of printed material. So when looking at two versions of the same paper that in different languages, you can clearly see that they are not the same and, therefore, must be different documents! However, the real value of a document is the information that it conveys. Thus, it is better to think about a document in the abstract form of its contents, instead of the traditional physical object that we look at and call a document.
In order to try to avoid the ambiguities and differing meanings associated with the term document, I have adopted the word digiment. This allows me to define a new concept that is not loaded with preconceptions, such as the idea of a document.
Digiments also do not limit the type of data they contain to just texts. A digiment can contain video clips, sound clips, images, text, or anything else. These objects can either be differing representations of the same thing, such as high and low resolution images of the same page, or they can be complementary, such as a book about animals with sound clips of each animal.
However, this approach is not really any different than any application defining its own storage format. The only difference between a PDF and the format in which WordPerfect stores its documents, for example, is that the PDF format is designed to hold more types of data, since it must be a storage format for more types of programs than just a word processor. In addition, it may contain system resources that may not be available on other systems, such as fonts. But the basic concept is still essentially the same; define a monolithic storage format that can be used by a single, large application that is capable of understanding the entire format.
This approach has two intrinsic drawbacks, however. The first problem is that it requires any program that can read the format to understand each of the data types and be able to work with them. As the format expands to include more types of data, understanding the data types becomes increasingly complex. The program must know how to work with or display an ever increasing number of different data formats.
The second problem is that the entire format must be modified when adding new data types. This leads to a release of a new version of the Acrobat format whenever a new feature or type needs to be added. This new format in turn requires that new Acrobat browsers and creators be released. Everyone knows the frustration of trying to read a Word 5.1 document, when all you have is the program Word 4.0. The only way around this problem is to try to shoehorn a new data type into an existing data type definition, which usually results in loss of some important information.
The digiment format takes a different approach. Instead of attempting to define a single format for every type of data, it is more like an envelope into which you can place arbitrary data types while preserving relationships among them. This approach is more like the one taken by component object systems, such as OpenDoc and OLE 2.0. These schemes also use a "container" based approach; instead of defining a format that must be able to represent each of the data types within it, they define a container object that has data of arbitrary types within. Thus, an application that is very specialized can still open the container and work with only the data objects with types that it understands.
The digiment format solves the two problems that occur with all-in-one formats. First, a digiment browser does not need to understand the individual formats contained within the digiment. Since all the data types are standard, if it doesn't know how to handle a particular data type, it can pass that data object to another program that does. This only requires that each data object has an easily identifiable type associated with it, which can be accomplished by using MIME types (see "What is MIME?" on page 33). In fact, one of the advantages of having multiple representations of a document is for people who may not have the software to view one representation, but can recognize a different one.
Secondly, new types can be easily added to a digiment without requiring any changes to the standard or to the browsing software. All that is required is for the browser to add a new entry to a table containing a list of data types and actions to perform on each one. For example, if you wanted to browse a document that contained a MPEG movie, instead of having to get a version of a digiment browser that knows about MPEG movies, you could simply use any software that supports MPEG and tell the browser to use that player whenever it sees a MPEG object.
To illustrate the difference between the two, a digiment can contain a PDF object within it. This can be viewed by any client that can read the PDF format. However, the same digiment can also contain the original word processing, spreadsheet and database files that created the PDF, as well as voice annotations and a Postscript version of the PDF.
Even newer documents, which may already be available in electronic form, can have different formats. A Postscript document may be available in two different flavors; one that is a complete file ready to print, and another version with the fonts not included in order to reduce transmission time for those users who already have the correct fonts. Additionally, a PDF file may be available. Finally, individual page images may be generated from the Postscript original for people to view who do not want to transfer the entire Postscript file at once or who do not have a Postscript viewer.
A digiment format should be able to handle documents that have parts available in multiple representations.
By defining a specific format for bibliographic information and including this as part of the document, a digiment can provide a way to extract information about a document in an architected manner.
I have chosen a format for digiments which relies heavily on MIME, or Multipurpose Internet Mail Extensions. MIME provides two very useful concepts; the ability to embed sub-objects within objects, and the ability to associate a data type with each object. By creating a digiment out of multiple sub-objects, a digiment can easily be created by simply selecting a legal set of these sub-objects. New formats can be added by incorporating more sub-objects. In addition, since the digiment format uses MIME types, it is easy to add arbitrary data types, as long as they are tagged with a valid MIME type. The digiment itself is simply a new MIME type, which can be passed around like any other digital object.
The actual objects which make up a digiment are described in Chapter 4.