Previous chapter

Appendix C

Appendix C

The CSTR Scanned Document Record

The CSTR Scanned Document Record

C.1 The CSTR Scanned Document Record, version CSTR 1.3

January 26, 1995 
By Jerry Saltzer, Jack Eisan, and Mike Cook
1. OBJECTIVES
The objective of this document is to specify both the form and content of the information 
that must be captured when a document is scanned, as an on-line record that becomes a 
component of the scanned form of the document.
The objective of the scanned document record is to capture information that is not explicit 
in the scanned image, yet is needed:
  1. to view, display, or print the image properly.
  2. to understand how to interpret the image.
  3. to meet contractual or legal requirements.
2. GENERAL APPROACH
The form of a scanned document record is a series of named fields, one per line. Each field 
line begins with the field name followed by a colon, then the field value. Any line of the 
scanned document record may have a comment at the end, preceded by a semicolon. Any 
program reading the file will ignore all comments. A field line may be of any length, but it 
may NOT include typed carriage returns. This form is intended to be easily created by a 
human operator using a word processor or spread sheet, but it is to be used by computer 
programs that browse, display, or print documents, so the contents of some of the fields 
must conform to specific standards.
The scanned document record does not specify the format of the image files, but it is 
intended for use with image files that are in the TIFF format.
The content of a scanned document record is most easily explained by first exhibiting a 
complete example, and then describing the requirements on field contents:
3. A COMPLETE EXAMPLE
Scanning record version: CSTR 1.3 
Publishing department: M. I. T. Lab for Computer Science 
Report label: MIT-LCS-TM-13 
Source: first-generation original 
Document series: TM 
Image count: 30 
Scanning agent: M. I. T. Document Services 
Input form: single-sided 
Input size: 8.5 x 11 
Suggested print form: double-sided 
Suggested print size: 6 x 9 
Scanner used: Fujitsu 3096g ADF 
Software used: Optix version 1.01 
Operator: Michael Cook 
Date Scanned: 9/28/1994 
Resolution(dpi): 400 
Greyscale depth(bits): 8 
Scanner settings: default 
Text quality: normal 
Note: 
Comment:  filename  size  cksum  contents   comments
Map:  scanrec.cstr.1.3.txt  45102  67890  format 
Map:  MIT-LCS-TR-13-srec.txt  113076  12345  scanrecord 
Map:  MIT-LCS-TR-13-001.tif  8417048  12345  cover  
Map:  MIT-LCS-TR-13-002.tif  8417048  01234  blank  
Map:  MIT-LCS-TR-13-003.tif  8417048  78912  unnumbered; title page
Map:  MIT-LCS-TR-13-004.tif  8417048  56789  blank  
Map:  MIT-LCS-TR-13-005.tif  8417048  23456  unnumbered; acknowledgment 
Map:  MIT-LCS-TR-13-006.tif  8417048  90123  blank  
Map:  MIT-LCS-TR-13-007.tif  8417048  67890  numbered 1  
Map:  MIT-LCS-TR-13-008.tif  8417048  34567  numbered 2  
Map:  MIT-LCS-TR-13-009.tif  8417048  01234  numbered 3  
Map:  MIT-LCS-TR-13-010.tif  8417048  78901  numbered 4  
Map:  MIT-LCS-TR-13-011.tif  8417048  45678  numbered 5  
Map:  MIT-LCS-TR-13-012.tif  8417048  12345  numbered 6  
Map:  MIT-LCS-TR-13-013.tif  8417048  89012  numbered 7  
Map:  MIT-LCS-TR-13-014.tif  8417048  56789  numbered 8  
Map:  MIT-LCS-TR-13-015.tif  8417048  23456  numbered 9  
Map:  MIT-LCS-TR-13-016.tif  8417048  90123  numbered 10; original skewed  
Map:  MIT-LCS-TR-13-017.tif  8417048  67890  numbered 11  
Map:  MIT-LCS-TR-13-018.tif  8417048  34567  numbered 12  
Map:  MIT-LCS-TR-13-019.tif  8417048  01234  numbered 13  
Map:  MIT-LCS-TR-13-020.tif  8417048  78901  numbered 14; original says 
page 16
Map:  MIT-LCS-TR-13-021.tif  8417048  45678  numbered 15  
Map:  MIT-LCS-TR-13-022.tif  8417048  12345  numbered 16  
Map:  MIT-LCS-TR-13-023.tif  8417048  89012  numbered 17  
Map:  MIT-LCS-TR-13-024.tif  8417048  56789  spine
Map:  MIT-LCS-TR-13-025.tif  8417048  23456  supporting; instructions to 
printer  
Map:  MIT-LCS-TR-13-026.tif  8417048  90123  doccontrol
Map:  MIT-LCS-TR-13-027.tif  8417048  67890  calibration IEEE-167a-1987  
Map:  MIT-LCS-TR-13-028.tif  8417048  34567  calibration AIIM-#2  
Map:  MIT-LCS-TR-13-029.tif  8417048  01234  agent
Map:  MIT-LCS-TR-13-030.tif  8417048  78789  scancontrol
4. REQUIREMENTS ON THE FIELDS
Here are the requirements on the field contents that come about in order for a computer 
program to be able to unambiguously interpret the scanning record:
Scanning record version: must contain exactly the string shown in the example. The ver
sion number is incremented only for changes in the specification that are incompatible 
with the practice of the prior version. If a change in the specification does not require a 
change in the practice, the version number is unchanged.
Publishing department, Technical report label, Document series, Scanner used, Software 
used, and Operator: each can contain any string of characters. Conventionally, the report 
label will begin with the string "MIT-"
Source: One of the strings "First-generation original", "Later-generation copy", or "Post
Script"
Image count: Must contain an integer. (The image count is the number of image files cre
ated for this document, including calibration, identification, and blank targets, etc. It 
matches the number in the image-number component of the highest-numbered Map 
record.)
Input form and Suggested print form: Must contain either "single-sided" or "double-
sided".
Input size and Suggested print size: Must contain two decimal numbers separated by the 
letter "x". The integers are assumed to be measurements in inches.
Date scanned: Month/Day/Year, each component being an integer (leading zeros omitted) 
and the year being four digits.
Resolution and Greyscale depth: must be integers
Scanner settings: At this time the only allowed value for this field is "default".
Note: an optional field containing any desired string. If this field is not empty, a properly 
constructed browser or print program will display its contents to the reader.
Comment: an optional field containing any desired string. It is ignored by browsers.
Map: Each file that comprises a single scanned document is represented by its own Map: 
field. This is the only field that may appear more than once in the scanning record. All 
Map fields must be together at the end of the scanning record. A Map field consists of an 
image file name, the file size, a five-digit checksum computed using the method of the 
BSD UNIX "sum" command, and an image content identifier. The checksum value that 
appears in the map for the scanning record file itself must be zero, and for the moment its 
value is ignored. The content identifier must be one of the following:
  scanrecord           Document scanning record 
  format           Scan Record specification (this specification) 
  scancontrol           Scanning project control form 
  doccontrol           Government control form 
  calibration           Test target 
  agent           Logo of the scanning agent 
  blank           Blank page inserted to maintain duplex printing 
order 
  spine           Image intended for the book spine 
  cover           Image intended for the book cover 
  unnumbered           Page for which it appears that the publisher did 
not intend a page number
  numbered          Page of text that the publisher intended to 
number, whether or not that number is printed on 
the page 
  supporting           Material not intended for display or printing
The content identifier "calibration" is followed by a string that identifies a calibration test 
target. The content identifier "numbered" is followed by a string that represents the pub
lisher's originally assigned page number.
The order of the Map fields is taken to be the order in which the images they describe are 
intended to be printed or displayed, except when the input form is double-sided. In that 
case all pages within the numbered region are assumed to have been scanned odd sides 
first, even sides last and reversed.
A browser should treat any field that it does not recognize as a comment, and ignore it 
unless the user has asked to be told about unrecognized fields.
5. DOCUMENT NAMING CONVENTIONS
The Map field of the document scanning record is flexible enough to allow any desired file 
name to be used for any component of the document. However, to minimize confusion, 
the following conventions are used in naming the files.
Image file names consist of five components separated by hyphens and followed by a suf
fix:
  1. The first component of all file names of documents originating at MIT is the string "MIT".
  2. The second component of the file name is a string that identifies the department or laboratory that originated the document, e.g. "LCS" or "AI".
  3. The third component of the file name is a string that identifies the document series, e.g. "TM", "TR", "AIM", etc.
  4. The fourth component of the file name is the series tag of the document. This tag is usually an integer such as "417", but it may include letters, as in "418a", and in some cases it may be an arbitrary string of letters.
  5. The fifth component of an image file name is the image number, with enough leading zeros that all image file names for the document are of the same length. Image numbers start with 1 and are consecutive in the order the images were scanned.
  6. The file name of an image file ends with the string ".tif".
  7. The file name of the scanning record ends with the string "srec.txt".
  8. The complete file name of the format specification of the scanning record is simply "format.txt".
The various images (in TIFF format), the scanning record (in text format), and the format 
specification of the scanning record (also in text format) are collected in a directory or 
folder which is named by the first four components above, separated by hyphens.
Note that upper- and lower-case letters in the file names of "Map" fields must be recorded 
exactly as they appear in the actual file name, because on some computer systems upper 
and lower-case letters are distinct.
6. OPEN QUESTIONS AND THINGS FOR FURTHER THOUGHT OR FUTURE 
REVISIONS
1. The "Document series" field is probably not needed, because the information is con
tained in the document label field.
2. The original document date isn't captured. Should it be?
3. Need a systematic way to insert scanning agent logo on cover page.
4. The content identifiers "scancontrol" and "spine" could be handled as "supporting", 
with their intended use appearing as comments.
5. There is no support for bringing the reader's attention to the copyright notice. One sus
pects that some scheme will be needed for a browser to locate that notice and display it. 
(Since TR's and TM's generally don't have copyright notices, this issue isn't particularly 
pressing.)
6. It may be appropriate to more systematically record observations by the scanning oper
ator, such as the original is skewed, off-center, or wrinkled or contains a photograph or 
other non-textual material.
7. The checksum of the scanning record that is recorded in the scanning record has the 
value of zero. The specification calls for it to be ignored because simply inserting the cur
rent checksum in the record would change its checksum. A procedure for adding a fiddle 
field at the end of the record whose value is chosen to force the actual checksum of the file 
to zero or some predictable value should be developed. Alternatively, the checksum for 
this file should be defined as being the checksum that is obtained when the checksum field 
for this file is replaced with zero.
8. It might be a good idea to add an "end:" field to make it easier to figure out that the scan 
record is complete.
9. The information captured about page numbers isn't really sufficient to establish which 
image is "next" in cases where one page is scanned twice (once in black and white and a 
second time in color) or is scanned in two parts, as in an oversize foldout.
10. There isn't enough information to figure out how to paste back together two image 
files to produce a single image (e.g., a separately- scanned color photo that belongs in the 
middle of a page of monochrome text.)
REVISION HISTORY
ACKNOWLEDGEMENT
This note is expanded from a set of ideas originally developed at a Library 2000 group 
meeting on March 17, 1994. Discussants: Jack Eisan, Mitchell Charity, Ali Alavi, Sally 
Richter, Mary Anne Ladd, Jeremy Hylton, Geoff Seyon, Eytan Adar, Greg Anderson, Jerry 
Saltzer. Since that time additional suggestions and ideas have come from Michael Cook, 
Gillian Elcock, Yoav Yerushalmi, and Andrew Kass.
For more information contact Jerry Saltzer <Saltzer@mit.edu> 

Next chapter