Scanned document record, version CSTR 1.1
October 30, 1994
By Jerry Saltzer

The objective of this document is to specify both the form and content of
the information that must be captured when a document is scanned, as an
on-line record that becomes a component of the scanned form of the
document.

The objective of the scanning record is to capture information that is not
explicit in the scanned image, yet is needed:

1.  to view, display, or print the image properly.

2.  to understand how to interpret the image.

3.  to meet contractual or legal requirements.

The form of a scanning record is a series of named fields, one per line.
Each field line begins with the field name followed by a colon, then the
field value. Any line of the scanning record may have a comment at the end,
preceded by a semicolon.  Any program reading the file will ignore all
comments.  A field line may be of any length, but it may NOT include typed
carriage returns.  This form is intended to be easily created by a human
operator using a word processor, but it is to be used by computer programs
that browse, display, or print documents, so the contents of some of the
fields must conform to specific standards.

The content of a scanning record is most easily explained by first
exhibiting a complete example, and then describing the requirements on
field contents:

Scanning record version: CSTR 1.1
Publishing department: M. I. T. Lab for Computer Science
Technical report label:  MIT-LCS-TM-13
Document series: TM
Page count:  24
Image count:  31
Scanning agent:  M. I. T. Document Services
Original form: single-sided
Original size:  8.5 x 11
Intended print form: double-sided
Intended print size:  6 x 9
Scanner used: Fujitsu 3096g ADF
Software used:  Optix version 1.01
Operator:  Michael Cook
Date Scanned: 9/28/1994
Resolution(dpi): 400
Greyscale depth(bits): 8
Scanner settings: default
File name generator:  LCS-TM-13-
Scanning record file name:  LCS-TM-13-data.txt
Note:
Map:  LCS-TM-13-image-01=cover
Map:  LCS-TM-13-image-02=blank
Map:  LCS-TM-13-image-03=unnumbered          ;  title page
Map:  LCS-TM-13-image-04=blank
Map:  LCS-TM-13-image-05=unnumbered          ;  acknowledgement
Map:  LCS-TM-13-image-06=blank
Map:  LCS-TM-13-image-07=page 1
Map:  ...
Map:  LCS-TM-13-image-24=page 18
Map:  LCS-TM-13-other-01=spine
Map:  LCS-TM-13-other-02=supporting          ; instructions to printer
Map:  LCS-TM-13-other-03=doccontrol
Map:  LCS-TM-13-other-04=calibration IEEE-167a-1987
Map:  LCS-TM-13-other-05=calibration AIIM-#2
Map:  LCS-TM-13-other-06=agent
Map:  LCS-TM-13-other-07=scancontrol



Here are the requirements on the field contents that come about in order
for a computer program to be able to unambiguously interpret the scanning
record:

Scanning record version:  must contain exactly the string shown in the example.

Publishing department, Technical report label, Document series, Scanner
used, Software used, and Operator:  each can contain any string of
characters.

Page count and Image count:  must contain integers.  (The page count is the
number of different original sides that need to be scanned.  The image
count is the number of image files created for this document, including
calibration, identification, and blank targets, etc.)

Original form and Intended print form:  Must contain either "single-sided"
or "double-sided".

Original size and Intended print size:  Must contain two decimal numbers
separated by the letter "x".  The integers are assumed to be measurements
in inches.

Date scanned:  Month/Day/Year, each component being an integer and the year
being four digits.

Resolution and Greyscale depth:  must be integers

Scanner settings:  At this time the only allowed value for this field is
"default".

File name generator:  A string that is used to prefix the names of all
image and data files produced in the course of scanning this document.
Note that upper- and lower-case letters in the file names appear in this
field exactly as they appear in the actual file name, because on some
computer systems upper and lower-case letters are distinct.

Scanning record file name:  The exact file name of the scanning record itself.

Note:  an optional field containing any desired string.  If this field is
not empty, a properly constructed browser or print program will display its
contents to the reader.

Map:  Every scanned image is represented by a Map: field.  This is the only
field that may appear more than once in the scanning record.  All Map
fields must be together at the end of the scanning record.  With one
exception, every Map field consists of an image file name followed by an
equal sign followed by an image content identifier.  The content identifier
must be one of the following:

     scancontrol              Scanning project control form
     control                  Government control form
     calibration              Test target
     agent                    Logo of the scanning agent
     blank                    Blank page inserted to maintain duplex
printing order
     spine                    Image intended for the book spine
     cover                    Image intended for the book cover
     unnumbered               Publisher did not intend a page number
     page                     Publisher's intended page number, even if omitted
     supporting               Material not intended for display or printing

The content identifier "calibration" is followed by a string that
identifies a calibration test target.  The content identifier "page" is
followed by a string that represents a publisher's originally assigned page
number.  A Map field may contain just the string "...", which means that
the preceding and following Map fields are the beginning and the end of a
consecutive series of numbered pages.  The order of the Map fields is taken
to be the order in which the images they describe are intended to be
printed or displayed.


-------------------------

Open questions:

1.  The "Document series" field is probably not needed, because the
information is contained in the document label field.

2.  The "File name generator" field may not be needed, because all file
names are explicitly listsed in the map.

3.  Original and printed sheet sizes are not recorded; they probably should
be, because some older TR's are printed in a 6 x 9-inch format from 8.5 x
11-inch originals, and other TR's may be printed from pages produced on
cut-sheet printers.

4.  The original document date isn't captured.  Should it be?

5.  Need systematic way to insert scanning agent logo on cover page.

6.  The content identifiers "scancontrol" and "spine" could be handled as
"supporting", with their intended use appearing as comments.

7.  The conventional file names suggested here are a little different from
those currently in use, but they are completely arbitrary.  The form of the
map defined here eliminates any need for a program to gather information
from a file name except when using consecutive image numbers to deduce
consecutive page numbers.

8.  There is no support for bringing the reader's attention to the
copyright notice.  One suspects that some scheme will be needed for a
browser to locate that notice and display it.  (Since TR's and TM's
generally don't have copyright notices, this issue isn't particularly
pressing.)

------------------------------------------------


Acknowledgement:  This note is expanded from a set of ideas originally
developed at a Library 2000 group meeting on March 17, 1994.  Discussants:
Jack Eisan, Mitchell Charity, Ali Alavi, Sally Richter, Mary Anne Ladd,
Jeremy Hylton, Geoff Seyon, Eytan Adar, Greg Anderson, Jerry Saltzer.
Since that time additional suggestions and ideas have come from Michael
Cook, Gillian Elcock, Yoav Yerushalmi, and Andrew Kass.