Scanned document record, version CSTR 1.3
January 26, 1995
By Jerry Saltzer, Jack Eisan, and Mike Cook

(Note: if you have read an earlier draft, you may prefer to look at the revision history of this document.)


The objective of this document is to specify both the form and content of the information that must be captured when a document is scanned, as an on-line record that becomes a component of the scanned form of the document.

The objective of the scanned document record is to capture information that is not explicit in the scanned image, yet is needed:


The form of a scanned document record is a series of named fields, one per line. Each field line begins with the field name followed by a colon, then the field value. Any line of the scanned document record may have a comment at the end, preceded by a semicolon. Any program reading the file will ignore all comments. A field line may be of any length, but it may NOT include typed carriage returns. This form is intended to be easily created by a human operator using a word processor or spread sheet, but it is to be used by computer programs that browse, display, or print documents, so the contents of some of the fields must conform to specific standards.

The scanned document record does not specify the format of the image files, but it is intended for use with image files that are in the TIFF format.

The content of a scanned document record is most easily explained by first exhibiting a complete example, and then describing the requirements on field contents:


Scanning record version: CSTR 1.3
Publishing department: M. I. T. Lab for Computer Science
Report label:  MIT-LCS-TM-13
Source:  first-generation original
Document series: TM
Image count:  30
Scanning agent:  M. I. T. Document Services
Input form: single-sided
Input size:  8.5 x 11
Suggested print form: double-sided
Suggested print size:  6 x 9
Scanner used: Fujitsu 3096g ADF
Software used:  Optix version 1.01
Operator:  Michael Cook
Date Scanned: 9/28/1994
Resolution(dpi): 400
Greyscale depth(bits): 8
Scanner settings: default
Text quality:  normal
Comment:   filename           size   cksum  contents      comments

Map:  scanrec.cstr.1.3.txt     45102 67890 format
Map:  MIT-LCS-TR-13-srec.txt  113076 12345 scanrecord
Map:  MIT-LCS-TR-13-001.tif  8417048 12345 cover
Map:  MIT-LCS-TR-13-002.tif  8417048 01234 blank         
Map:  MIT-LCS-TR-13-003.tif  8417048 78912 unnumbered    ;  title page
Map:  MIT-LCS-TR-13-004.tif  8417048 56789 blank         
Map:  MIT-LCS-TR-13-005.tif  8417048 23456 unnumbered    ;  acknowledgement
Map:  MIT-LCS-TR-13-006.tif  8417048 90123 blank         
Map:  MIT-LCS-TR-13-007.tif  8417048 67890 numbered 1 
Map:  MIT-LCS-TR-13-008.tif  8417048 34567 numbered 2    
Map:  MIT-LCS-TR-13-009.tif  8417048 01234 numbered 3    
Map:  MIT-LCS-TR-13-010.tif  8417048 78901 numbered 4    
Map:  MIT-LCS-TR-13-011.tif  8417048 45678 numbered 5    
Map:  MIT-LCS-TR-13-012.tif  8417048 12345 numbered 6    
Map:  MIT-LCS-TR-13-013.tif  8417048 89012 numbered 7    
Map:  MIT-LCS-TR-13-014.tif  8417048 56789 numbered 8    
Map:  MIT-LCS-TR-13-015.tif  8417048 23456 numbered 9    
Map:  MIT-LCS-TR-13-016.tif  8417048 90123 numbered 10    ; original skewed
Map:  MIT-LCS-TR-13-017.tif  8417048 67890 numbered 11   
Map:  MIT-LCS-TR-13-018.tif  8417048 34567 numbered 12   
Map:  MIT-LCS-TR-13-019.tif  8417048 01234 numbered 13   
Map:  MIT-LCS-TR-13-020.tif  8417048 78901 numbered 14    ; original says page 16
Map:  MIT-LCS-TR-13-021.tif  8417048 45678 numbered 15   
Map:  MIT-LCS-TR-13-022.tif  8417048 12345 numbered 16   
Map:  MIT-LCS-TR-13-023.tif  8417048 89012 numbered 17   
Map:  MIT-LCS-TR-13-024.tif  8417048 56789 spine         
Map:  MIT-LCS-TR-13-025.tif  8417048 23456 supporting     ; instructions to printer
Map:  MIT-LCS-TR-13-026.tif  8417048 90123 doccontrol    
Map:  MIT-LCS-TR-13-027.tif  8417048 67890 calibration IEEE-167a-1987 
Map:  MIT-LCS-TR-13-028.tif  8417048 34567 calibration AIIM-#2
Map:  MIT-LCS-TR-13-029.tif  8417048 01234 agent
Map:  MIT-LCS-TR-13-030.tif  8417048 78789 scancontrol


Here are the requirements on the field contents that come about in order for a computer program to be able to unambiguously interpret the scanning record:

Scanning record version: must contain exactly the string shown in the example. The version number is incremented only for changes in the specification that are incompatible with the practice of the prior version. If a change in the specification does not require a change in the practice,, the version number is unchanged.

Publishing department, Technical report label, Document series, Scanner used, Software used, and Operator: each can contain any string of characters. Conventionally, the report label will begin with the string "MIT-"

Source: One of the strings "First-generation original", "Later-generation copy", or "PostScript"

Image count: Must contain an integer. (The image count is the number of image files created for this document, including calibration, identification, and blank targets, etc. It matches the number in the image-number component of the highest-numbered Map record.)

Input form and Suggested print form: Must contain either "single-sided" or "double-sided".

Input size and Suggested print size: Must contain two decimal numbers separated by the letter "x". The integers are assumed to be measurements in inches.

Date scanned: Month/Day/Year, each component being an integer (leading zeros omitted) and the year being four digits.

Resolution and Greyscale depth: must be integers

Scanner settings: At this time the only allowed value for this field is "default".

Note: an optional field containing any desired string. If this field is not empty, a properly constructed browser or print program will display its contents to the reader.

Comment: an optional field containing any desired string. It is ignored by browsers.

Map: Each file that comprises a single scanned document is represented by its own Map: field. This is the only field that may appear more than once in the scanning record. All Map fields must be together at the end of the scanning record. A Map field consists of an image file name, the file size, a five-digit checksum computed using the method of the BSD UNIX "sum" command, and an image content identifier. The checksum value that appears in the map for the scanning record file itself must be zero, and for the moment its value is ignored. The content identifier must be one of the following:

scanrecord Document scanning record
format Scan Record specification (this specification)
scancontrol Scanning project control form
doccontrol Government control form
calibration Test target
agent Logo of the scanning agent
blank Blank page inserted to maintain duplex printing order
spine Image intended for the book spine
cover Image intended for the book cover
unnumbered Page for which it appears that
the publisher did not intend a page number
numbered Page of text that the publisher intended to number,
whether or not that number is printed on the page
supporting Material not intended for display or printing

The content identifier "calibration" is followed by a string that identifies a calibration test target. The content identifier "numbered" is followed by a string that represents the publisher's originally assigned page number.

The order of the Map fields is taken to be the order in which the images they describe are intended to be printed or displayed, except when the input form is double-sided. In that case all pages within the numbered region are assumed to have been scanned odd sides first, even sides last and reversed.

A browser should treat any field that it does not recognize as a comment, and ignore it unless the user has asked to be told about unrecognized fields.


The Map field of the document scanning record is flexible enough to allow any desired file name to be used for any component of the document. However, to minimize confusion, the following conventions are used in naming the files.

Image file names consist of five components separated by hyphens and followed by a suffix:

The various images (in TIFF format), the scanning record (in text format), and the format specification of the scanning record (also in text format) are collected in a directory or folder which is named by the first four components above, separated by hyphens.

Note that upper- and lower-case letters in the file names of "Map" fields must be recorded exactly as they appear in the actual file name, because on some computer systems upper and lower-case letters are distinct.


1. The "Document series" field is probably not needed, because the information is contained in the document label field.

2. The original document date isn't captured. Should it be?

3. Need a systematic way to insert scanning agent logo on cover page.

4. The content identifiers "scancontrol" and "spine" could be handled as "supporting", with their intended use appearing as comments.

5. There is no support for bringing the reader's attention to the copyright notice. One suspects that some scheme will be needed for a browser to locate that notice and display it. (Since TR's and TM's generally don't have copyright notices, this issue isn't particularly pressing.)

6. It may be appropriate to more systematically record observations by the scanning operator, such as the original is skewed,off-center, or wrinkled or contains a photograph or other non-textual material.

7. The checksum of the scanning record that is recorded in the scanning record has the value of zero. The specification calls for it to be ignored because simply inserting the current checksum in the record would change its checksum. A procedure for adding a fiddle field at the end of the record whose value is chosen to force the actual checksum of the file to zero or some predictable value should be developed. Alternatively, the checksum for this file should be defined as being the checksum that is obtained when the checksum field for this file is replaced with zero.

8. It might be a good idea to add an "end:" field to make it easier to figure out that the scan record is complete.

9. The information captured about page numbers isn't really sufficient to establish which image is "next" in cases where one page is scanned twice (once in black and white and a second time in color) or is scanned in two parts, as in an oversize foldout.

10. There isn't enough information to figure out how to paste back together two image files to produce a single image (e.g., a separately- scanned color photo that belongs in the middle of a page of monochrome text.)



This note is expanded from a set of ideas originally developed at a Library 2000 group meeting on March 17, 1994. Discussants: Jack Eisan, Mitchell Charity, Ali Alavi, Sally Richter, Mary Anne Ladd, Jeremy Hylton, Geoff Seyon, Eytan Adar, Greg Anderson, Jerry Saltzer. Since that time additional suggestions and ideas have come from Michael Cook, Gillian Elcock, Yoav Yerushalmi, and Andrew Kass.

For more information please contact Jerry Saltzer,