The CSTR Scanned Document Record
January 26, 1995
By Jerry Saltzer, Jack Eisan, and Mike Cook
The objective of this document is to specify both the form and content of the information
that must be captured when a document is scanned, as an on-line record that becomes a
component of the scanned form of the document.
The objective of the scanned document record is to capture information that is not explicit
in the scanned image, yet is needed:
- to view, display, or print the image properly.
- to understand how to interpret the image.
- to meet contractual or legal requirements.
2. GENERAL APPROACH
The form of a scanned document record is a series of named fields, one per line. Each field
line begins with the field name followed by a colon, then the field value. Any line of the
scanned document record may have a comment at the end, preceded by a semicolon. Any
program reading the file will ignore all comments. A field line may be of any length, but it
may NOT include typed carriage returns. This form is intended to be easily created by a
human operator using a word processor or spread sheet, but it is to be used by computer
programs that browse, display, or print documents, so the contents of some of the fields
must conform to specific standards.
The scanned document record does not specify the format of the image files, but it is
intended for use with image files that are in the TIFF format.
The content of a scanned document record is most easily explained by first exhibiting a
complete example, and then describing the requirements on field contents:
3. A COMPLETE EXAMPLE
Scanning record version: CSTR 1.3
Publishing department: M. I. T. Lab for Computer Science
Report label: MIT-LCS-TM-13
Source: first-generation original
Document series: TM
Image count: 30
Scanning agent: M. I. T. Document Services
Input form: single-sided
Input size: 8.5 x 11
Suggested print form: double-sided
Suggested print size: 6 x 9
Scanner used: Fujitsu 3096g ADF
Software used: Optix version 1.01
Operator: Michael Cook
Date Scanned: 9/28/1994
Greyscale depth(bits): 8
Scanner settings: default
Text quality: normal
Comment: filename size cksum contents comments
Map: scanrec.cstr.1.3.txt 45102 67890 format
Map: MIT-LCS-TR-13-srec.txt 113076 12345 scanrecord
Map: MIT-LCS-TR-13-001.tif 8417048 12345 cover
Map: MIT-LCS-TR-13-002.tif 8417048 01234 blank
Map: MIT-LCS-TR-13-003.tif 8417048 78912 unnumbered; title page
Map: MIT-LCS-TR-13-004.tif 8417048 56789 blank
Map: MIT-LCS-TR-13-005.tif 8417048 23456 unnumbered; acknowledgment
Map: MIT-LCS-TR-13-006.tif 8417048 90123 blank
Map: MIT-LCS-TR-13-007.tif 8417048 67890 numbered 1
Map: MIT-LCS-TR-13-008.tif 8417048 34567 numbered 2
Map: MIT-LCS-TR-13-009.tif 8417048 01234 numbered 3
Map: MIT-LCS-TR-13-010.tif 8417048 78901 numbered 4
Map: MIT-LCS-TR-13-011.tif 8417048 45678 numbered 5
Map: MIT-LCS-TR-13-012.tif 8417048 12345 numbered 6
Map: MIT-LCS-TR-13-013.tif 8417048 89012 numbered 7
Map: MIT-LCS-TR-13-014.tif 8417048 56789 numbered 8
Map: MIT-LCS-TR-13-015.tif 8417048 23456 numbered 9
Map: MIT-LCS-TR-13-016.tif 8417048 90123 numbered 10; original skewed
Map: MIT-LCS-TR-13-017.tif 8417048 67890 numbered 11
Map: MIT-LCS-TR-13-018.tif 8417048 34567 numbered 12
Map: MIT-LCS-TR-13-019.tif 8417048 01234 numbered 13
Map: MIT-LCS-TR-13-020.tif 8417048 78901 numbered 14; original says
Map: MIT-LCS-TR-13-021.tif 8417048 45678 numbered 15
Map: MIT-LCS-TR-13-022.tif 8417048 12345 numbered 16
Map: MIT-LCS-TR-13-023.tif 8417048 89012 numbered 17
Map: MIT-LCS-TR-13-024.tif 8417048 56789 spine
Map: MIT-LCS-TR-13-025.tif 8417048 23456 supporting; instructions to
Map: MIT-LCS-TR-13-026.tif 8417048 90123 doccontrol
Map: MIT-LCS-TR-13-027.tif 8417048 67890 calibration IEEE-167a-1987
Map: MIT-LCS-TR-13-028.tif 8417048 34567 calibration AIIM-#2
Map: MIT-LCS-TR-13-029.tif 8417048 01234 agent
Map: MIT-LCS-TR-13-030.tif 8417048 78789 scancontrol
4. REQUIREMENTS ON THE FIELDS
Here are the requirements on the field contents that come about in order for a computer
program to be able to unambiguously interpret the scanning record:
Scanning record version: must contain exactly the string shown in the example. The ver
sion number is incremented only for changes in the specification that are incompatible
with the practice of the prior version. If a change in the specification does not require a
change in the practice, the version number is unchanged.
Publishing department, Technical report label, Document series, Scanner used, Software
used, and Operator: each can contain any string of characters. Conventionally, the report
label will begin with the string "MIT-"
Source: One of the strings "First-generation original", "Later-generation copy", or "Post
Image count: Must contain an integer. (The image count is the number of image files cre
ated for this document, including calibration, identification, and blank targets, etc. It
matches the number in the image-number component of the highest-numbered Map
Input form and Suggested print form: Must contain either "single-sided" or "double-
Input size and Suggested print size: Must contain two decimal numbers separated by the
letter "x". The integers are assumed to be measurements in inches.
Date scanned: Month/Day/Year, each component being an integer (leading zeros omitted)
and the year being four digits.
Resolution and Greyscale depth: must be integers
Scanner settings: At this time the only allowed value for this field is "default".
Note: an optional field containing any desired string. If this field is not empty, a properly
constructed browser or print program will display its contents to the reader.
Comment: an optional field containing any desired string. It is ignored by browsers.
Map: Each file that comprises a single scanned document is represented by its own Map:
field. This is the only field that may appear more than once in the scanning record. All
Map fields must be together at the end of the scanning record. A Map field consists of an
image file name, the file size, a five-digit checksum computed using the method of the
BSD UNIX "sum" command, and an image content identifier. The checksum value that
appears in the map for the scanning record file itself must be zero, and for the moment its
value is ignored. The content identifier must be one of the following:
scanrecord Document scanning record
format Scan Record specification (this specification)
scancontrol Scanning project control form
doccontrol Government control form
calibration Test target
agent Logo of the scanning agent
blank Blank page inserted to maintain duplex printing
spine Image intended for the book spine
cover Image intended for the book cover
unnumbered Page for which it appears that the publisher did
not intend a page number
numbered Page of text that the publisher intended to
number, whether or not that number is printed on
supporting Material not intended for display or printing
The content identifier "calibration" is followed by a string that identifies a calibration test
target. The content identifier "numbered" is followed by a string that represents the pub
lisher's originally assigned page number.
The order of the Map fields is taken to be the order in which the images they describe are
intended to be printed or displayed, except when the input form is double-sided. In that
case all pages within the numbered region are assumed to have been scanned odd sides
first, even sides last and reversed.
A browser should treat any field that it does not recognize as a comment, and ignore it
unless the user has asked to be told about unrecognized fields.
5. DOCUMENT NAMING CONVENTIONS
The Map field of the document scanning record is flexible enough to allow any desired file
name to be used for any component of the document. However, to minimize confusion,
the following conventions are used in naming the files.
Image file names consist of five components separated by hyphens and followed by a suf
- The first component of all file names of documents originating at MIT is the string "MIT".
- The second component of the file name is a string that identifies the department or laboratory that originated the document, e.g. "LCS" or "AI".
- The third component of the file name is a string that identifies the document series, e.g. "TM", "TR", "AIM", etc.
- The fourth component of the file name is the series tag of the document. This tag is usually an integer such as "417", but it may include letters, as in "418a", and in some cases it may be an arbitrary string of letters.
- The fifth component of an image file name is the image number, with enough leading zeros that all image file names for the document are of the same length. Image numbers start with 1 and are consecutive in the order the images were scanned.
- The file name of an image file ends with the string ".tif".
- The file name of the scanning record ends with the string "srec.txt".
- The complete file name of the format specification of the scanning record is simply "format.txt".
The various images (in TIFF format), the scanning record (in text format), and the format
specification of the scanning record (also in text format) are collected in a directory or
folder which is named by the first four components above, separated by hyphens.
Note that upper- and lower-case letters in the file names of "Map" fields must be recorded
exactly as they appear in the actual file name, because on some computer systems upper
and lower-case letters are distinct.
6. OPEN QUESTIONS AND THINGS FOR FURTHER THOUGHT OR FUTURE
1. The "Document series" field is probably not needed, because the information is con
tained in the document label field.
2. The original document date isn't captured. Should it be?
3. Need a systematic way to insert scanning agent logo on cover page.
4. The content identifiers "scancontrol" and "spine" could be handled as "supporting",
with their intended use appearing as comments.
5. There is no support for bringing the reader's attention to the copyright notice. One sus
pects that some scheme will be needed for a browser to locate that notice and display it.
(Since TR's and TM's generally don't have copyright notices, this issue isn't particularly
6. It may be appropriate to more systematically record observations by the scanning oper
ator, such as the original is skewed, off-center, or wrinkled or contains a photograph or
other non-textual material.
7. The checksum of the scanning record that is recorded in the scanning record has the
value of zero. The specification calls for it to be ignored because simply inserting the cur
rent checksum in the record would change its checksum. A procedure for adding a fiddle
field at the end of the record whose value is chosen to force the actual checksum of the file
to zero or some predictable value should be developed. Alternatively, the checksum for
this file should be defined as being the checksum that is obtained when the checksum field
for this file is replaced with zero.
8. It might be a good idea to add an "end:" field to make it easier to figure out that the scan
record is complete.
9. The information captured about page numbers isn't really sufficient to establish which
image is "next" in cases where one page is scanned twice (once in black and white and a
second time in color) or is scanned in two parts, as in an oversize foldout.
10. There isn't enough information to figure out how to paste back together two image
files to produce a single image (e.g., a separately- scanned color photo that belongs in the
middle of a page of monochrome text.)
This note is expanded from a set of ideas originally developed at a Library 2000 group
meeting on March 17, 1994. Discussants: Jack Eisan, Mitchell Charity, Ali Alavi, Sally
Richter, Mary Anne Ladd, Jeremy Hylton, Geoff Seyon, Eytan Adar, Greg Anderson, Jerry
Saltzer. Since that time additional suggestions and ideas have come from Michael Cook,
Gillian Elcock, Yoav Yerushalmi, and Andrew Kass.
For more information contact Jerry Saltzer <Saltzer@mit.edu>