Scanned Document Record

Scanned document record, version CSTR 1.3
January 11, 1995
By Jerry Saltzer, Jack Eisan, and Mike Cook

(Note: if you have read an earlier draft, you may prefer to look at the revision history of this document.)

1. OBJECTIVES

The objective of this document is to specify both the form and content of the information that must be captured when a document is scanned, as an on-line record that becomes a component of the scanned form of the document.

The objective of the scanned document record is to capture information that is not explicit in the scanned image, yet is needed:

to view, display, or print the image properly.
to understand how to interpret the image.
to meet contractual or legal requirements.

2. GENERAL APPROACH

The form of a scanned document record is a series of named fields, one per line. Each field line begins with the field name followed by a colon, then the field value. Any line of the scanned document record may have a comment at the end, preceded by a semicolon. Any program reading the file will ignore all comments. A field line may be of any length, but it may NOT include typed carriage returns. This form is intended to be easily created by a human operator using a word processor or spread sheet, but it is to be used by computer programs that browse, display, or print documents, so the contents of some of the fields must conform to specific standards.

The scanned document record does not specify the format of the image files, but it is intended for use with image files that are in the TIFF format.

The content of a scanned document record is most easily explained by first exhibiting a complete example, and then describing the requirements on field contents:

3. A COMPLETE EXAMPLE

Scanning record version: CSTR 1.3
Publishing department: M. I. T. Lab for Computer Science
Report label:  MIT-LCS-TM-13
Source:  first-generation original
Document series: TM
Image count:  30
Scanning agent:  M. I. T. Document Services
Input form: single-sided
Input size:  8.5 x 11
Suggested print form: double-sided
Suggested print size:  6 x 9
Scanner used: Fujitsu 3096g ADF
Software used:  Optix version 1.01
Operator:  Michael Cook
Date Scanned: 9/28/1994
Resolution(dpi): 400
Greyscale depth(bits): 8
Scanner settings: default
Text quality:  normal
Note:
Comment:   filename           size   cksum  contents      comments

Map:  scanrec.cstr.1.3.txt     45102 67890 format
Map:  MIT-LCS-TR-13-srec.txt  113076 12345 scanrecord
Map:  MIT-LCS-TR-13-001.tif  8417048 12345 cover
Map:  MIT-LCS-TR-13-002.tif  8417048 01234 blank         
Map:  MIT-LCS-TR-13-003.tif  8417048 78912 unnumbered    ;  title page
Map:  MIT-LCS-TR-13-004.tif  8417048 56789 blank         
Map:  MIT-LCS-TR-13-005.tif  8417048 23456 unnumbered    ;  acknowledgement
Map:  MIT-LCS-TR-13-006.tif  8417048 90123 blank         
Map:  MIT-LCS-TR-13-007.tif  8417048 67890 numbered 1 
Map:  MIT-LCS-TR-13-008.tif  8417048 34567 numbered 2    
Map:  MIT-LCS-TR-13-009.tif  8417048 01234 numbered 3    
Map:  MIT-LCS-TR-13-010.tif  8417048 78901 numbered 4    
Map:  MIT-LCS-TR-13-011.tif  8417048 45678 numbered 5    
Map:  MIT-LCS-TR-13-012.tif  8417048 12345 numbered 6    
Map:  MIT-LCS-TR-13-013.tif  8417048 89012 numbered 7    
Map:  MIT-LCS-TR-13-014.tif  8417048 56789 numbered 8    
Map:  MIT-LCS-TR-13-015.tif  8417048 23456 numbered 9    
Map:  MIT-LCS-TR-13-016.tif  8417048 90123 numbered 10    ; original skewed
Map:  MIT-LCS-TR-13-017.tif  8417048 67890 numbered 11   
Map:  MIT-LCS-TR-13-018.tif  8417048 34567 numbered 12   
Map:  MIT-LCS-TR-13-019.tif  8417048 01234 numbered 13   
Map:  MIT-LCS-TR-13-020.tif  8417048 78901 numbered 14    ; original says page 16
Map:  MIT-LCS-TR-13-021.tif  8417048 45678 numbered 15   
Map:  MIT-LCS-TR-13-022.tif  8417048 12345 numbered 16   
Map:  MIT-LCS-TR-13-023.tif  8417048 89012 numbered 17   
Map:  MIT-LCS-TR-13-024.tif  8417048 56789 spine         
Map:  MIT-LCS-TR-13-025.tif  8417048 23456 supporting     ; instructions to printer
Map:  MIT-LCS-TR-13-026.tif  8417048 90123 doccontrol    
Map:  MIT-LCS-TR-13-027.tif  8417048 67890 calibration IEEE-167a-1987 
Map:  MIT-LCS-TR-13-028.tif  8417048 34567 calibration AIIM-#2
Map:  MIT-LCS-TR-13-029.tif  8417048 01234 agent
Map:  MIT-LCS-TR-13-030.tif  8417048 78789 scancontrol

4. REQUIREMENTS ON THE FIELDS

Here are the requirements on the field contents that come about in order for a computer program to be able to unambiguously interpret the scanning record:

Scanned document record version: must contain exactly the string shown in the example.

Publishing department, Technical report label, Document series, Scanner used, Software used, and Operator: each can contain any string of characters. Conventionally, the report label will begin with the string "MIT-"

Source: One of the strings "First-generation original", "Later-generation copy", or "PostScript"

Image count: Must contain an integer. (The image count is the number of image files created for this document, including calibration, identification, and blank targets, etc. It matches the number in the image-number component of the highest-numbered Map record.)

Input form and Suggested print form: Must contain either "single-sided" or "double-sided".

Input size and Suggested print size: Must contain two decimal numbers separated by the letter "x". The integers are assumed to be measurements in inches.

Date scanned: Month/Day/Year, each component being an integer (leading zeros omitted) and the year being four digits.

Resolution and Greyscale depth: must be integers

Scanner settings: At this time the only allowed value for this field is "default".

Note: an optional field containing any desired string. If this field is not empty, a properly constructed browser or print program will display its contents to the reader.

Comment: an optional field containing any desired string. It is ignored by browsers.

Map: Each file that comprises a single scanned document is represented by its own Map: field. This is the only field that may appear more than once in the scanning record. All Map fields must be together at the end of the scanning record. A Map field consists of an image file name, the file size, a five-digit checksum computed using the method of the BSD UNIX "sum" command, and an image content identifier. The checksum value that appears in the map for the scanning record file itself must be zero, and for the moment its value is ignored. The content identifier must be one of the following:


     scanrecord         Document scanning record

     format             Scan Record specification (this specification)

     scancontrol        Scanning project control form

     doccontrol         Government control form

     calibration        Test target

     agent              Logo of the scanning agent

     blank              Blank page inserted to maintain duplex printing order

     spine              Image intended for the book spine

     cover              Image intended for the book cover

     unnumbered         Page for which it appears that

                           the publisher did not intend a page number

     numbered           Page of text that the publisher intended to number

     supporting         Material not intended for display or printing

The content identifier "calibration" is followed by a string that identifies a calibration test target. The content identifier "numbered" is followed by a string that represents the publisher's originally assigned page number. Except for numbered pages, the order of the Map fields is taken to be the order in which the images they describe are intended to be printed or displayed. Numbered pages may appear in the map in any order.

A browser should treat any field that it does not recognize as a comment, and ignore it unless the user has asked to be told about unrecognized fields.

5. DOCUMENT NAMING CONVENTIONS

The Map field of the document scanning record is flexible enough to allow any desired file name to be used for any component of the document. However, to minimize confusion, the following conventions are used in naming the files.

Image file names consist of five components separated by hyphens and followed by a suffix:

The first component of all file names of documents originating at MIT is the string "MIT".
The second component of the file name is a string that identifies the department or laboratory that originated the document, e.g. "LCS" or "AI".
The third component of the file name is a string that identifies the document series, e.g. "TM", "TR", "AIM", etc.
The fourth component of the file name is the series tag of the document. This tag is usually an integer such as "417", but it may include letters, as in "418a", and in some cases it may be an arbitrary string of letters.
The fifth component of an image file name is the image number, with enough leading zeros that all image file names for the document are of the same length. Image numbers start with 1 and are consecutive in the order the images were scanned.
The file name of an image file ends with the string ".tif".
The file name of the scanning record ends with the string "srec.txt".
The complete file name of the format specification of the scanning recordis simply "format.txt".

The various images (in TIFF format), the scanning record (in text format), and the format specification of the scanning record (also in text format) are collected in a directory or folder which is named by the first four components above, separated by hyphens.

Note that upper- and lower-case letters in the file names of "Map" fields must be recorded exactly as they appear in the actual file name, because on some computer systems upper and lower-case letters are distinct.

6. OPEN QUESTIONS AND THINGS FOR FURTHER THOUGHT OR FUTURE REVISIONS

1. The "Document series" field is probably not needed, because the information is contained in the document label field.

2. Original and input sheet sizes are not separately recorded; perhaps they should be. (Some older TR's were printed in a 6 x 9-inch format from 8.5 x 11-inch originals, and other TR's may be printed from pages produced on cut-sheet printers.)

3. The original document date isn't captured. Should it be?

4. Need a systematic way to insert scanning agent logo on cover page.

5. The content identifiers "scancontrol" and "spine" could be handled as "supporting", with their intended use appearing as comments.

6. There is no support for bringing the reader's attention to the copyright notice. One suspects that some scheme will be needed for a browser to locate that notice and display it. (Since TR's and TM's generally don't have copyright notices, this issue isn't particularly pressing.)

7. It may be appropriate to more systematically record observations by the scanning operator, such as the original is skewed,off-center, or wrinkled or contains a photograph or other non-textual material.

8. The checksum of the scanning record that is recorded in the scanning record has the value of zero. The specification calls for it to be ignored because simply inserting the current checksum in the record would change its checksum. A procedure for adding a fiddle field at the end of the record whose value is chosen to force the actual checksum of the file to zero or some predictable value should be developed.

9. REVISION HISTORY

ACKNOWLEDGEMENT

This note is expanded from a set of ideas originally developed at a Library 2000 group meeting on March 17, 1994. Discussants: Jack Eisan, Mitchell Charity, Ali Alavi, Sally Richter, Mary Anne Ladd, Jeremy Hylton, Geoff Seyon, Eytan Adar, Greg Anderson, Jerry Saltzer. Since that time additional suggestions and ideas have come from Michael Cook, Gillian Elcock, Yoav Yerushalmi, and Andrew Kass.

For more information please contact Jerry Saltzer, Saltzer@mit.edu

Scanned document record, version CSTR 1.3 January 11, 1995 By Jerry Saltzer, Jack Eisan, and Mike Cook

Scanned document record, version CSTR 1.3
January 11, 1995
By Jerry Saltzer, Jack Eisan, and Mike Cook