Previous chapter

Chapter 5

Chapter 5

A Prototype Digiment Generator

A Prototype Digiment Generator

This chapter describes a system that is able to generate digiments on the fly for any of the MIT LCS/AI Technical Reports that are on-line as part of the Library 2000 project.

5.1 General Overview

As part of designing the digiment format, I have created a system that generates multipart/digiments for the Technical Reports (TRs) which have been placed on-line as part of the Library 2000 project. These TRs generally consist of either Postscript, which has been collected from the author as part of the submission process, or scanned images, which have been collected from our goal of scanning in all LCS and AI Lab technical reports. In some cases, both images, which usually have been generated from the Postscript, and a Postscript version exist. In addition, all the TRs on-line contain a bibliographic entry in a standard bibliographic reference form, defined by RFC1357.

5.1.1 Creating a digiment on the fly

The system that I created generates digiments "on the fly" for MIT TRs. This is accomplished by having the server run a special program, create-digiment, whenever a TR is requested in digiment form. This program determines the available formats of the TR, and generates the appropriate digiment. Create-digiment also uses a helper program, called pif2digiment, to create page-maps and page-lists. Both of these programs are written in Perl. Both programs also assume that they have access to the filesystem where the actual document images and other files are stored locally. This is currently assured by having the programs run on the server workstation.

5.1.2 How documents are stored on the server

The create-digiment program has knowledge of how the actual TRs are stored locally in order to determine what parts of the TR are available and in which formats. Each TR is stored in a specific directory on the server; the name of this directory can be uniquely determined by the ID of the TR. A special routine can take the ID of a TR and return the path name of the directory where it is stored.

Within each document directory, there are several items. Almost every TR available has an RFC1357 bibliographic record stored there. If a Postscript version is available, it too will be stored in this directory. If the document was scanned, a copy of the scanned document record will be available here. Finally, a subdirectory exists for each different format for which images are available. By looking at the contents of this directory, it is possible to determine what formats a TR is available in and to access each of these parts.

5.1.3 Parts of the digiment generator system

The process of generating a digiment can be broken down into several stages. The first step is acquiring the documents. This was already being done by the Library 2000 project. Documents are either scanned in from originals, or Postscript is acquired as part of the author's submission process. If the document has been scanned, the scanning station operator creates a scanned document record for the document.

Next, the documents are processed and placed on-line. Postscript versions are compressed and put into the proper directory. However, scanned documents are first processed into different resolutions and then stored in the proper directory. This process also generates a PIF, described below, for each resolution. Although the image processing was already part of the Library 2000 system, the PIFs were created specifically to support generation of digiments.

Finally, when a digiment is requested, the server runs the create-digiment program, which generates a proper digiment and returns it. The rest of this chapter describes the functioning of this program.

5.2 PIF Files

Part of the image processing process for scanned documents includes generating a Page Information File, or a PIF. The PIF file stores valuable information about an image, such as the format it is in, the page number of the image as the author intended, and a brief description of the contents of the page. These files are only generated for documents that have been scanned as part of the Library 2000 project.

5.2.1 Scanned document records

When each document is scanned, a human operator creates a scanned document record for the document. This record contains information about the document being scanned, the hardware and software used, and the name of the operator. Additionally, it contains a map of each page that has been scanned. For each file in a set of scanned images for a document, the scan record lists the name of the file, the size and checksum of the file, the contents of the file as determined by the human operator and any comments on that page that may be appropriate. This page mapping is used in the creation of PIFs. The scanned document record is more completely described in Appendix C.

5.2.2 Processing the scanned images

The original documents are scanned at 400 DPI, 8 bit greyscale, and stored as TIFF images. However, as each of these images takes approximately 14.5 MB of storage space, we process these images into versions more suitable for distribution. Currently, each image is converted into a 600 DPI monochrome TIFF image, suitable for printing, and a 100 DPI, 5 bit greyscale GIF image, which is designed for on-line browsing. This processing takes place automatically via a series of programs that read the scanned document record and convert each page to the proper formats.

5.2.3 Creating a PIF file

During this processing step a PIF is created for each of the different image formats. A PIF contains a listing of each of the scanned images as they were processed, which is generally the order in which they were scanned. For each image, the PIF lists several pieces of information. First is the image number, which is the number used to access the image from the server. Next is the format of the image. Third is the page number; this is either a number if the image is a numbered page, or the value "unnumbered" if the page has no page number or is an image that is not directly part of the document, such as a calibration sheet. This field is extracted from the content field of the scanned document record. Finally, the last field contains a description of the image; this is basically the content and the comment fields of the scanned document record concatenated together. Each field is separated by a tab; this allows spaces to be used within the fields. The first few lines of a typical PIF look like this:

Version:    1.0
Processing-comments:          Tue Apr 11 09:57:37 1995
Image:  1  image/gif;dpi=100;bits=5                  unnumbered        doccontrol
Image:  2  image/gif;dpi=100;bits=5                  unnumbered        doccontrol
Image:  3  image/gif;dpi=100;bits=5                  unnumbered        unnumbered;title page
Image:  4  image/gif;dpi=100;bits=5                  2        numbered 2
Image:  5  image/gif;dpi=100;bits=5                  3        numbered 3
A similar PIF would be created for the 600 DPI TIFF images.

5.3 The Create-digiment Program

The actual digiment is created by the create-digiment program. This program looks at the contents of a document's directory and any available PIF files to generate a proper digiment. This program is run whenever a client requests a document in digiment form from the server. Thus, any newly-added formats for a document will show up immediately when the next client requests the digiment.

The create-digiment program generates page-lists, page-maps, and body-lists, as well as a part-list and a bibliography part.

5.4 Generating Page-lists and Page-maps

The create-digiment program creates page-list and page-map digiment-types for each digiment for which PIF files exist. The create-digiment program calls a separate program, pif2digiment, which generates a page-map and page-list for each image format that exists for a particular TR. The pif2digiment program takes a PIF file, a MIME multipart boundary string and the ID of the document as arguments, are returns a page-list and a page-map application/digiment, separated by the multipart boundary. This program is run for each PIF associated with a document, creating a page-list and page-map for each image format.

The pif2digiment program scans through the entire PIF, generating a page-list and a page-map. Although these two parts are described separately, the actual processing takes place simultaneously so that the program does not need to scan through the PIF twice. When the program has scanned the entire PIF file, it prints out the page-map part, and then the page-list part, separated by an encapsulation boundary.

5.4.1 Generating page-lists

The page-list is created by first generating the standard MIME and digiment headers. Then the program generates a URL-stem that can be used to access the document. Since all of the documents are being served using the Dienst protocol, the URL-stem created uses this format. The following general URL can be used to access individual pages of a particular document that is stored on a Dienst server:

http://server.name.here/Server/TR/technical-report-ID/Page/#?type=data/type
The values in italics are replaced by the values as explained in Table 5.1 on page 52.

Table 5.1:  Values for Dienst URL to access a page
-----------------------------------------------------------------------
Value name           Meaning                                             
-----------------------------------------------------------------------
server.name.here     The name of the Dienst server that stores the doc   
                     ument                                               
technical-report-ID  The CSTR ID for a technical report that is stored   
                     on the server                                       
#                    The image number requested                          
data/type            The format in which the image is requested          
                                                                         
-----------------------------------------------------------------------
The portion of the URL that is identical for all images, and therefore can be placed in the URL-stem header is everything up to the page number. So a typical URL-stem for a MIT TR would look like this:

URL-stem:    http://cstr-http.lcs.mit.edu/Server/TR/AI-TM-1066/Page/
Next, the content-type header is created. Since all the images are assumed to be of the same type for this program, the content-type from the first entry in the PIF is extracted and used for this field. The last header, page-map, contains the Content-ID of the page-map that is simultaneously generated from the same PIF file.

The rest of the page-list contains an entry for each image as specified in section 4.4.5 on page 40. The entry is constructed from the information in the PIF. The image number from the PIF is used as the VSN for the page. The remainder of the URL is constructed by concatenating the image number, the string "?type=" and the format type to complete a Dienst URL as described above. The page description field is created by looking at the image format; if it is a 600 DPI image, the field is entered with "Printable image". With 100 DPI images, the field contains "Viewable image".

5.4.2 Generating page-maps

Generating page-maps is very similar to generating page-lists. As the pif2digiment program is generating a line of the page-list, it generates the corresponding line of the page-map. Both lines are based on the same line of the PIF.

The first field in a Map: entry is the same VSN that is used in the page-list, and also comes from the PIF image number. The next two fields are the page content and page description fields. These fields are created by looking at the page number field of the PIF. If the page is a numbered page, the page content field is filled with the page number, as is the page description field. If the page number is unnumbered, then the entry depends on the description field of the PIF. If the description field contains "title page", then the page number field becomes "title page", and the page description field becomes "Title page". If the description field contains "blank", then the page number field becomes "unnumbered", as does the page description field. Any other value in the description field causes the page content field to become "supporting", while the page description field gets the value to the PIF description field. A summary of this is presented in Table 5.2 on page 53.

Table 5.2:  Mapping from PIF to page-map fields
------------------------------------------------------------------------
Page number      Description    Page content           Page description   
------------------------------------------------------------------------
any number       anything       Page number field      Description field  
unnumbered       title page     title page             Title page         
unnumbered       blank          unnumbered             unnumbered         
unnumbered       anything else  supporting             Description field  
                                                                          
------------------------------------------------------------------------

5.5 Generating the Other Digiment Parts

The create-digiment program also generates a bibliography part, a part-list part and a body-list part (if Postscript is available).

5.5.1 Generating the bibliography digiment-type

The bibliography application/digiment part is very simple to generate. All of the technical reports on-line have RFC1357 style bibliographic records stored in the TR's directory. The create-digiment program simply creates the standard MIME and application/digiment headers, and then appends the contents of the RFC1357 bibliographic file.

5.5.2 Generating the body-list digiment-type

Currently, a body-list part is only generated for those TRs that have Postscript versions available. If a Postscript version exists in the TR's directory, the create-digiment program first generates the standard MIME and application/digiment headers. Then the program generates a URL-stem that can be used to access the document. Since all of the documents are being served using the Dienst protocol, the URL-stem created uses this format. The following general URL can be used to access the entire body of a particular document that is stored on a Dienst server:

http://server.name.here/Server/TR/technical-report-ID/Body?type=data/type
The values in italics are replaced by the values as explained in Table 5.1 on page 54.

Table 5.1:  Values for Dienst URL to access a page
-----------------------------------------------------------------------
Value name           Meaning                                             
-----------------------------------------------------------------------
server.name.here     The name of the Dienst server that stores the doc   
                     ument                                               
technical-report-ID  The CSTR ID for a technical report that is stored   
                     on the server                                       
data/type            The format in which the body is requested           
                                                                         
-----------------------------------------------------------------------
The portion of the URL that is identical for all body parts, and therefore can be placed in the URL-stem header is everything up to the question mark. So a typical URL-stem for a MIT TR would look like this:

URL-stem:    http://cstr-http.lcs.mit.edu/Server/TR/AI-TM-1066/Body
Next, the content-type header is filled in. The current version of the program only allows a value of "application/postscript". Finally, a single Body: line is created with a VSN of 1, a URL value of "?type=application/postscript", and a Page Description field of "Postscript document". Since the URL-stem line uniquely identifies the TR, this line is the same for all TRs and looks like this:

Body:  1    ?type=application/postscript                      Postscript document

5.5.3 Generating the part-list digiment-type

The part-list digiment type is generated by the create-digiment program. After all of the other application/digiments that are part of the digiment have been generated, the program scans each digiment part and extracts the digiment-type and content-ID from each one. In the case of a page-list or body-list part, it also extracts the content-type. The program then creates a part-list by generating the standard MIME and application/digiment headers, then creating a Part: line for each application/digiment, consisting of the digiment type and the content-ID, followed by the content-type if applicable.

5.6 Putting It All Together

The digiment creation process is completely automated, and works for all MIT TRs which have bibliographic, Postscript or scanned images available on-line. The following is an outline of the complete process, from request to delivery of a digiment.

Since digiments are created on the fly, the process does not start until a digiment has been requested. This is accomplished by requesting the body of a document from the server with a type of multipart/digiment, the base MIME type for a digiment. Thus, a typical URL requesting a digiment might look like this:

http://cstr-http.lcs.mit.edu/Server/TR/MIT-AILab:AIM-1066/Body?type=multipart/digiment
When the server is processing the request, it notices that the type requested is multipart/digiment. At this point, the server calls the program create-digiment, with the ID of the TR as an argument.

The create-digiment program first resolves the ID into the directory path where the document is stored locally. The program then checks the directory to make sure that it is actually a valid directory which contains a document. If not, it returns an error. If the directory does contain a document, the program continues by generating a bibliography part. This part is then put into a list of application/digiment parts.

The create-digiment program then looks at each of the subdirectories of the document directory to see if they contain a PIF. If a PIF exists for that subdirectory, the program inserts it into a list of PIFs. Then, the program calls pif2digiment for each PIF that was found. Each call returns a page-list and a page-map which was generated from that PIF. Each of these parts are put into the list of application/digiments.

Next, the program checks to see if a Postscript file exists in the document directory. If it does, the program generates a body-list part for the Postscript.

Finally, the program generates a part-list from all of the other application/digiment parts that have been created. It then creates a MIME header for the multipart/digiment document and returns this followed by each of the application/digiment parts, separated by the appropriate encapsulation boundaries. The server takes this return value, which is a legal multipart/digiment, and returns it to the client. The actual response time varies depending on the number of pages and formats of the document and the load of the server, but is typically on the order of a few seconds.

The entire process is summarized briefly below:

  1. Client requests digiment from server.
  2. Server calls create-digiment with ID of document.
  3. Create-digiment validates ID and local storage of document.
  4. Create-digiment gets a list of PIFs which are available.
  5. Create-digiment generates a bibliography part.
  6. Create-digiment generates a page-list and page-map for each PIF by calling pif2digiment.
  7. Create-digiment generates a body-list if Postscript is available.
  8. Create-digiment generates a part-list.
  9. Create-digiment returns the application/digiments in a multipart/digiment format.
  10. Server returns the multipart/digiment to the client.

Next chapter