[The title and date appearing in the header below are incorrect.
 This document appears to be the third draft of version 1.1, created
November 18, 1994, with the addition of an unspecified checksum field.]

Scanned document record, version CSTR 1.1 (second draft)
November 9, 1994
By Jerry Saltzer, Jack Eisan, and Mike Cook

[A list of changes since the previous draft will be found at the end.]

OBJECTIVES

The objective of this document is to specify both the form and content of
the information that must be captured when a document is scanned, as an
on-line record that becomes a component of the scanned form of the
document.

The objective of the scanned document record is to capture information that
is not explicit in the scanned image, yet is needed:

1.  to view, display, or print the image properly.

2.  to understand how to interpret the image.

3.  to meet contractual or legal requirements.


GENERAL APPROACH

The form of a scanned document record is a series of named fields, one per
line.  Each field line begins with the field name followed by a colon, then
the field value. Any line of the scanned document record may have a comment
at the end, preceded by a semicolon.  Any program reading the file will
ignore all comments.  A field line may be of any length, but it may NOT
include typed carriage returns.  This form is intended to be easily created
by a human operator using a word processor or spread sheet, but it is to be
used by computer programs that browse, display, or print documents, so the
contents of some of the fields must conform to specific standards.

The content of a scanned document record is most easily explained by first
exhibiting a complete example, and then describing the requirements on
field contents:


A COMPLETE EXAMPLE

Scanned document record version: CSTR 1.1
Publishing department: M. I. T. Lab for Computer Science
Report label:  MIT-LCS-TM-13
Source:  first-generation original
Document series: TM
Image count:  30
Scanning agent:  M. I. T. Document Services
Input form: single-sided
Input size:  8.5 x 11
Suggested print form: double-sided
Suggested print size:  6 x 9
Scanner used: Fujitsu 3096g ADF
Software used:  Optix version 1.01
Operator:  Michael Cook
Date Scanned: 9/28/1994
Resolution(dpi): 400
Greyscale depth(bits): 8
Scanner settings: default
Text quality:  normal
Scanned document record file name:  MIT-TR-13-scanrecord.txt
Note:
Map:  MIT-TR-13-image-1   1234567 cover
Map:  MIT-TR-13-image-2   8901234 blank
Map:  MIT-TR-13-image-3   5678912 unnumbered        ;  title page
Map:  MIT-TR-13-image-4   3456789 blank
Map:  MIT-TR-13-image-5   0123456 unnumbered         ;  acknowledgement
Map:  MIT-TR-13-image-6   7890123 blank
Map:  MIT-TR-13-image-7   4567890 numbered 1
Map:  MIT-TR-13-image-8   1234567 numbered 2
Map:  MIT-TR-13-image-9   8901234 numbered 3
Map:  MIT-TR-13-image-10  5678901 numbered 4
Map:  MIT-TR-13-image-11  2345678 numbered 5
Map:  MIT-TR-13-image-12  9012345 numbered 6
Map:  MIT-TR-13-image-13  6789012 numbered 7
Map:  MIT-TR-13-image-14  3456789 numbered 8
Map:  MIT-TR-13-image-15  0123456 numbered 9
Map:  MIT-TR-13-image-16  7890123 numbered 10          ; original skewed
Map:  MIT-TR-13-image-17  4567890 numbered 11
Map:  MIT-TR-13-image-18  1234567 numbered 12
Map:  MIT-TR-13-image-19  8901234 numbered 13
Map:  MIT-TR-13-image-20  5678901 numbered 14          ; original says page 16
Map:  MIT-TR-13-image-21  2345678 numbered 15
Map:  MIT-TR-13-image-22  9012345 numbered 16
Map:  MIT-TR-13-image-23  6789012 numbered 17
Map:  MIT-TR-13-image-24  3456789 spine
Map:  MIT-TR-13-image-25  0123456 supporting           ; instructions to printer
Map:  MIT-TR-13-image-26  7890123 doccontrol
Map:  MIT-TR-13-image-27  4567890 calibration IEEE-167a-1987
Map:  MIT-TR-13-image-28  1234567 calibration AIIM-#2
Map:  MIT-TR-13-image-29  8901234 agent
Map:  MIT-TR-13-image-30  5678789 scancontrol


REQUIREMENTS ON THE FIELDS

Here are the requirements on the field contents that come about in order
for a computer program to be able to unambiguously interpret the scanning
record:

Scanned document record version:  must contain exactly the string shown in
the example.

Publishing department, Technical report label, Document series, Scanner
used, Software used, and Operator:  each can contain any string of
characters.  Conventionally, the report label will begin with the string
"MIT-"

Source:  One of the strings "First-generation original", "Later-generation
copy", or "PostScript"

Image count:  Must contain an integer.  (The image count is the number of
image files created for this document, including calibration,
identification, and blank targets, etc.  It matches the number in the
highest-numbered Map record.)

Input form and Suggested print form:  Must contain either "single-sided" or
"double-sided".

Input size and Suggested print size:  Must contain two decimal numbers
separated by the letter "x".  The integers are assumed to be measurements
in inches.

Date scanned:  Month/Day/Year, each component being an integer (leading
zeros omitted) and the year being four digits.

Resolution and Greyscale depth:  must be integers

Scanner settings:  At this time the only allowed value for this field is
"default".

File name:  A string that is used to prefix the names of all image and data
files produced in the course of scanning this document.  Note that upper-
and lower-case letters in the file names appear in this field exactly as
they appear in the actual file name, because on some computer systems upper
and lower-case letters are distinct.

Scanned document record file name:  The exact file name of the scanning
record itself with the file name prefix omitted.

Note:  an optional field containing any desired string.  If this field is
not empty, a properly constructed browser or print program will display its
contents to the reader.

Map:  Every scanned image is represented by a Map: field.  This is the only
field that may appear more than once in the scanning record.  All Map
fields must be together at the end of the scanning record, in order by
image number.  A Map field consists of an image file name, a checksum, and
an image content identifier.  The content identifier must be one of the
following:

     scancontrol              Scanning project control form
     control                  Government control form
     calibration              Test target
     agent                    Logo of the scanning agent
     blank                    Blank page inserted to maintain duplex printing order
     spine                    Image intended for the book spine
     cover                    Image intended for the book cover
     unnumbered               Page for which it appears that
                                 the publisher did not intend a page number
     numbered                 Page of text that the publisher intended to number
     supporting               Material not intended for display or printing

The content identifier "calibration" is followed by a string that
identifies a calibration test target.  The content identifier "page" is
followed by a string that represents the publisher's originally assigned
page number.  Except for numbered pages, the order of the Map fields is
taken to be the order in which the images they describe are intended to be
printed or displayed.  Numbered pages may appear in the map in any order.


OPEN QUESTIONS AND THINGS FOR FURTHER THOUGHT OR FUTURE REVISIONS

1.  The "Document series" field is probably not needed, because the
information is contained in the document label field.

2.  Original and input sheet sizes are not separately recorded; perhaps
they should be. (Some older TR's were printed in a 6 x 9-inch format from
8.5 x 11-inch originals, and other TR's may be printed from pages produced
on cut-sheet printers.)

4.  The original document date isn't captured.  Should it be?

5.  Need a systematic way to insert scanning agent logo on cover page.

6.  The content identifiers "scancontrol" and "spine" could be handled as
"supporting", with their intended use appearing as comments.

7.  There is no support for bringing the reader's attention to the
copyright notice.  One suspects that some scheme will be needed for a
browser to locate that notice and display it.  (Since TR's and TM's
generally don't have copyright notices, this issue isn't particularly
pressing.)

8.  It may be appropriate to more systematically record observations by the
scanning operator, such as the original is skewed,off-center, or wrinkled
or contains a photograph or other non-textual material.


ACKNOWLEDGEMENT

This note is expanded from a set of ideas originally developed at a Library
2000 group meeting on March 17, 1994.  Discussants:  Jack Eisan, Mitchell
Charity, Ali Alavi, Sally Richter, Mary Anne Ladd, Jeremy Hylton, Geoff
Seyon, Eytan Adar, Greg Anderson, Jerry Saltzer.  Since that time
additional suggestions and ideas have come from Michael Cook, Gillian
Elcock, Yoav Yerushalmi, and Andrew Kass.


-------------------------------------------------------------------

Changes in the second draft:

1.  The page count field is gone, on the basis that the concept is
ill-defined.  The only count field is now the image count, which is a
precise concept.

2.  The names of the fields "Original form" and "Original size" are now
"Input form" and "Input size" to reflect the observation that these are the
form and size of the pages actually scanned.  The first-generation
originals may have had a different form or size.

3.  The names of the fields "Intended print form" and "Intended print size"
are now "Suggested print form" and "Suggested print size" to reflect the
possibility that the original intent may not be known; this is the best
guess of the current publisher.

4.  There is a new field "Text quality" with the allowed values "light",
"dark" and "normal", to capture observations by the scanning operator. 
This field is a first cut at communicating this class of information from
the scanning operator to the display process and it is likely to evolve.

5.  The field "file name generator" is now simply "File name".  In addition
to using this string as the prefix for each image file name, it will also
be used conventionally as the name of the folder or directory that contains
all the image and related files for a single report.  Also, the
conventional value of the field no longer ends with a hyphen; the hyphen is
understood to be inserted whenever the name is used as a prefix.

6.  The value of the scanning record file name field no longer includes the
file name prefix, but the file name itself does carry the prefix.

7.  The names of files in Map fields no longer include the filename prefix,
but the file names themselves do carry the prefix..

8.  All file names in Map fields now begin with the string "image-".  The
original thought of distinguishing image- from other- is better handled by
the content descriptor associated with each image.

9.  Image numbers in image filenames no longer have leading zeros.

10.  The "..." convention is gone; it is replaced with a plan that there
will be one Map field for every image of the file.  (This change is coupled
with a plan to use Excel to create the scanned image record.  Excel, given
two consecutive examples of image-to-page number correspondences, can
easily fill in any number of similar such correspondences.)

11.  Problems with particular pages, such as the original being skewed or
wrinkled or containing a photo, will be recorded as a comment on the Map
entry for that page.

12.  The "page" content identifier is renamed "numbered" and is defined as
containing the publisher's intended page number (assuming the intent can be
determined) whether or not the page number actually appears on the page.