Scanned document record, version CSTR 1.1 (second draft)
November 9, 1994
By Jerry Saltzer, Jack Eisan, and Mike Cook


OBJECTIVES

The objective of this document is to specify both the form and content of
the information that must be captured when a document is scanned, as an
on-line record that becomes a component of the scanned form of the
document.

The objective of the scanned document record is to capture information that
is not explicit in the scanned image, yet is needed:

1.  to view, display, or print the image properly.

2.  to understand how to interpret the image.

3.  to meet contractual or legal requirements.


GENERAL APPROACH

The form of a scanned document record is a series of named fields, one per
line.  Each field line begins with the field name followed by a colon, then
the field value. Any line of the scanned document record may have a comment
at the end, preceded by a semicolon.  Any program reading the file will
ignore all comments.  A field line may be of any length, but it may NOT
include typed carriage returns.  This form is intended to be easily created
by a human operator using a word processor or spread sheet, but it is to be
used by computer programs that browse, display, or print documents, so the
contents of some of the fields must conform to specific standards.

The content of a scanned document record is most easily explained by first
exhibiting a complete example, and then describing the requirements on
field contents:


A COMPLETE EXAMPLE

Scanned document record version: CSTR 1.1
Publishing department: M. I. T. Lab for Computer Science
Report label:  MIT-LCS-TM-13
Document series: TM
Image count:  30
Scanning agent:  M. I. T. Document Services
Input form: single-sided
Input size:  8.5 x 11
Suggested print form: double-sided
Suggested print size:  6 x 9
Scanner used: Fujitsu 3096g ADF
Software used:  Optix version 1.01
Operator:  Michael Cook
Date Scanned: 9/28/1994
Resolution(dpi): 400
Greyscale depth(bits): 8
Scanner settings: default
Text quality:  normal
File name:  LCS-TM-13
Scanned document record file name:  scanrecord.txt
Note:
Map:  image-1   cover
Map:  image-2   blank
Map:  image-3   unnumbered        ;  title page
Map:  image-4   blank
Map:  image-5   unnumbered         ;  acknowledgement
Map:  image-6   blank
Map:  image-7   numbered 1
Map:  image-8   numbered 2
Map:  image-9   numbered 3
Map:  image-10  numbered 4
Map:  image-11  numbered 5
Map:  image-12  numbered 6
Map:  image-13  numbered 7
Map:  image-14  numbered 8
Map:  image-15  numbered 9
Map:  image-16  numbered 10            ; original skewed
Map:  image-17  numbered 11
Map:  image-18  numbered 12
Map:  image-19  numbered 13
Map:  image-20  numbered 14            ; original says page 16
Map:  image-21  numbered 15
Map:  image-22  numbered 16
Map:  image-23  numbered 17
Map:  image-24  spine
Map:  image-25  supporting          ; instructions to printer
Map:  image-26  doccontrol
Map:  image-27  calibration IEEE-167a-1987
Map:  image-28  calibration AIIM-#2
Map:  image-29  agent
Map:  image-30  scancontrol


REQUIREMENTS ON THE FIELDS

Here are the requirements on the field contents that come about in order
for a computer program to be able to unambiguously interpret the scanning
record:

Scanned document record version:  must contain exactly the string shown in
the example.

Publishing department, Technical report label, Document series, Scanner
used, Software used, and Operator:  each can contain any string of
characters.  Conventionally, the report label will begin with the string
"MIT-"

Image count:  Must contain an integer.  (The image count is the number of
image files created for this document, including calibration,
identification, and blank targets, etc.  It matches the number in the
highest-numbered Map record.)

Input form and Suggested print form:  Must contain either "single-sided" or
"double-sided".

Input size and Suggested print size:  Must contain two decimal numbers
separated by the letter "x".  The integers are assumed to be measurements
in inches.

Date scanned:  Month/Day/Year, each component being an integer (leading
zeros omitted) and the year being four digits.

Resolution and Greyscale depth:  must be integers

Scanner settings:  At this time the only allowed value for this field is
"default".

File name:  A string that is used to prefix the names of all image and data
files produced in the course of scanning this document.  Note that upper-
and lower-case letters in the file names appear in this field exactly as
they appear in the actual file name, because on some computer systems upper
and lower-case letters are distinct.

Scanned document record file name:  The exact file name of the scanning
record itself with the file name prefix omitted.

Note:  an optional field containing any desired string.  If this field is
not empty, a properly constructed browser or print program will display its
contents to the reader.

Map:  Every scanned image is represented by a Map: field.  This is the only
field that may appear more than once in the scanning record.  All Map
fields must be together at the end of the scanning record, in order by
image number.  A Map field consists of an image file name followed by an
image content identifier.  The content identifier must be one of the
following:

     scancontrol              Scanning project control form
     control                  Government control form
     calibration              Test target
     agent                    Logo of the scanning agent
     blank                    Blank page inserted to maintain duplex printing order
     spine                    Image intended for the book spine
     cover                    Image intended for the book cover
     unnumbered               Page for which it appears that
                                 the publisher did not intend a page number
     numbered                 Page of text that the publisher intended to number
     supporting               Material not intended for display or printing

The content identifier "calibration" is followed by a string that
identifies a calibration test target.  The content identifier "page" is
followed by a string that represents the publisher's originally assigned
page number.  Except for numbered pages, the order of the Map fields is
taken to be the order in which the images they describe are intended to be
printed or displayed.  Numbered pages may appear in the map in any order.


OPEN QUESTIONS AND THINGS FOR FURTHER THOUGHT OR FUTURE REVISIONS

1.  The "Document series" field is probably not needed, because the
information is contained in the document label field.

2.  Original and input sheet sizes are not separately recorded; perhaps
they should be. (Some older TR's were printed in a 6 x 9-inch format from
8.5 x 11-inch originals, and other TR's may be printed from pages produced
on cut-sheet printers.)

4.  The original document date isn't captured.  Should it be?

5.  Need a systematic way to insert scanning agent logo on cover page.

6.  The content identifiers "scancontrol" and "spine" could be handled as
"supporting", with their intended use appearing as comments.

7.  There is no support for bringing the reader's attention to the
copyright notice.  One suspects that some scheme will be needed for a
browser to locate that notice and display it.  (Since TR's and TM's
generally don't have copyright notices, this issue isn't particularly
pressing.)

8.  It may be appropriate to more systematically record observations by the
scanning operator, such as the original is skewed,off-center, or wrinkled
or contains a photograph or other non-textual material.


ACKNOWLEDGEMENT

This note is expanded from a set of ideas originally developed at a Library
2000 group meeting on March 17, 1994.  Discussants:  Jack Eisan, Mitchell
Charity, Ali Alavi, Sally Richter, Mary Anne Ladd, Jeremy Hylton, Geoff
Seyon, Eytan Adar, Greg Anderson, Jerry Saltzer.  Since that time
additional suggestions and ideas have come from Michael Cook, Gillian
Elcock, Yoav Yerushalmi, and Andrew Kass.