Scanned document record, version CSTR 1.1 October 30, 1994 By Jerry Saltzer The objective of this document is to specify both the form and content of the information that must be captured when a document is scanned, as an on-line record that becomes a component of the scanned form of the document. The objective of the scanning record is to capture information that is not explicit in the scanned image, yet is needed: 1. to view, display, or print the image properly. 2. to understand how to interpret the image. 3. to meet contractual or legal requirements. The form of a scanning record is a series of named fields, one per line. Each field line begins with the field name followed by a colon, then the field value. Any line of the scanning record may have a comment at the end, preceded by a semicolon. Any program reading the file will ignore all comments. A field line may be of any length, but it may NOT include typed carriage returns. This form is intended to be easily created by a human operator using a word processor, but it is to be used by computer programs that browse, display, or print documents, so the contents of some of the fields must conform to specific standards. The content of a scanning record is most easily explained by first exhibiting a complete example, and then describing the requirements on field contents: Scanning record version: CSTR 1.1 Publishing department: M. I. T. Lab for Computer Science Technical report label: MIT-LCS-TM-13 Document series: TM Page count: 24 Image count: 31 Scanning agent: M. I. T. Document Services Original form: single-sided Original size: 8.5 x 11 Intended print form: double-sided Intended print size: 6 x 9 Scanner used: Fujitsu 3096g ADF Software used: Optix version 1.01 Operator: Michael Cook Date Scanned: 9/28/1994 Resolution(dpi): 400 Greyscale depth(bits): 8 Scanner settings: default File name generator: LCS-TM-13- Scanning record file name: LCS-TM-13-data.txt Note: Map: LCS-TM-13-image-01=cover Map: LCS-TM-13-image-02=blank Map: LCS-TM-13-image-03=unnumbered ; title page Map: LCS-TM-13-image-04=blank Map: LCS-TM-13-image-05=unnumbered ; acknowledgement Map: LCS-TM-13-image-06=blank Map: LCS-TM-13-image-07=page 1 Map: ... Map: LCS-TM-13-image-24=page 18 Map: LCS-TM-13-other-01=spine Map: LCS-TM-13-other-02=supporting ; instructions to printer Map: LCS-TM-13-other-03=doccontrol Map: LCS-TM-13-other-04=calibration IEEE-167a-1987 Map: LCS-TM-13-other-05=calibration AIIM-#2 Map: LCS-TM-13-other-06=agent Map: LCS-TM-13-other-07=scancontrol Here are the requirements on the field contents that come about in order for a computer program to be able to unambiguously interpret the scanning record: Scanning record version: must contain exactly the string shown in the example. Publishing department, Technical report label, Document series, Scanner used, Software used, and Operator: each can contain any string of characters. Page count and Image count: must contain integers. (The page count is the number of different original sides that need to be scanned. The image count is the number of image files created for this document, including calibration, identification, and blank targets, etc.) Original form and Intended print form: Must contain either "single-sided" or "double-sided". Original size and Intended print size: Must contain two decimal numbers separated by the letter "x". The integers are assumed to be measurements in inches. Date scanned: Month/Day/Year, each component being an integer and the year being four digits. Resolution and Greyscale depth: must be integers Scanner settings: At this time the only allowed value for this field is "default". File name generator: A string that is used to prefix the names of all image and data files produced in the course of scanning this document. Note that upper- and lower-case letters in the file names appear in this field exactly as they appear in the actual file name, because on some computer systems upper and lower-case letters are distinct. Scanning record file name: The exact file name of the scanning record itself. Note: an optional field containing any desired string. If this field is not empty, a properly constructed browser or print program will display its contents to the reader. Map: Every scanned image is represented by a Map: field. This is the only field that may appear more than once in the scanning record. All Map fields must be together at the end of the scanning record. With one exception, every Map field consists of an image file name followed by an equal sign followed by an image content identifier. The content identifier must be one of the following: scancontrol Scanning project control form control Government control form calibration Test target agent Logo of the scanning agent blank Blank page inserted to maintain duplex printing order spine Image intended for the book spine cover Image intended for the book cover unnumbered Publisher did not intend a page number page Publisher's intended page number, even if omitted supporting Material not intended for display or printing The content identifier "calibration" is followed by a string that identifies a calibration test target. The content identifier "page" is followed by a string that represents a publisher's originally assigned page number. A Map field may contain just the string "...", which means that the preceding and following Map fields are the beginning and the end of a consecutive series of numbered pages. The order of the Map fields is taken to be the order in which the images they describe are intended to be printed or displayed. ------------------------- Open questions: 1. The "Document series" field is probably not needed, because the information is contained in the document label field. 2. The "File name generator" field may not be needed, because all file names are explicitly listsed in the map. 3. Original and printed sheet sizes are not recorded; they probably should be, because some older TR's are printed in a 6 x 9-inch format from 8.5 x 11-inch originals, and other TR's may be printed from pages produced on cut-sheet printers. 4. The original document date isn't captured. Should it be? 5. Need systematic way to insert scanning agent logo on cover page. 6. The content identifiers "scancontrol" and "spine" could be handled as "supporting", with their intended use appearing as comments. 7. The conventional file names suggested here are a little different from those currently in use, but they are completely arbitrary. The form of the map defined here eliminates any need for a program to gather information from a file name except when using consecutive image numbers to deduce consecutive page numbers. 8. There is no support for bringing the reader's attention to the copyright notice. One suspects that some scheme will be needed for a browser to locate that notice and display it. (Since TR's and TM's generally don't have copyright notices, this issue isn't particularly pressing.) ------------------------------------------------ Acknowledgement: This note is expanded from a set of ideas originally developed at a Library 2000 group meeting on March 17, 1994. Discussants: Jack Eisan, Mitchell Charity, Ali Alavi, Sally Richter, Mary Anne Ladd, Jeremy Hylton, Geoff Seyon, Eytan Adar, Greg Anderson, Jerry Saltzer. Since that time additional suggestions and ideas have come from Michael Cook, Gillian Elcock, Yoav Yerushalmi, and Andrew Kass.