[The title and date appearing in the header below are incorrect. This document appears to be the third draft of version 1.1, created November 18, 1994, with the addition of an unspecified checksum field.] Scanned document record, version CSTR 1.1 (second draft) November 9, 1994 By Jerry Saltzer, Jack Eisan, and Mike Cook [A list of changes since the previous draft will be found at the end.] OBJECTIVES The objective of this document is to specify both the form and content of the information that must be captured when a document is scanned, as an on-line record that becomes a component of the scanned form of the document. The objective of the scanned document record is to capture information that is not explicit in the scanned image, yet is needed: 1. to view, display, or print the image properly. 2. to understand how to interpret the image. 3. to meet contractual or legal requirements. GENERAL APPROACH The form of a scanned document record is a series of named fields, one per line. Each field line begins with the field name followed by a colon, then the field value. Any line of the scanned document record may have a comment at the end, preceded by a semicolon. Any program reading the file will ignore all comments. A field line may be of any length, but it may NOT include typed carriage returns. This form is intended to be easily created by a human operator using a word processor or spread sheet, but it is to be used by computer programs that browse, display, or print documents, so the contents of some of the fields must conform to specific standards. The content of a scanned document record is most easily explained by first exhibiting a complete example, and then describing the requirements on field contents: A COMPLETE EXAMPLE Scanned document record version: CSTR 1.1 Publishing department: M. I. T. Lab for Computer Science Report label: MIT-LCS-TM-13 Source: first-generation original Document series: TM Image count: 30 Scanning agent: M. I. T. Document Services Input form: single-sided Input size: 8.5 x 11 Suggested print form: double-sided Suggested print size: 6 x 9 Scanner used: Fujitsu 3096g ADF Software used: Optix version 1.01 Operator: Michael Cook Date Scanned: 9/28/1994 Resolution(dpi): 400 Greyscale depth(bits): 8 Scanner settings: default Text quality: normal Scanned document record file name: MIT-TR-13-scanrecord.txt Note: Map: MIT-TR-13-image-1 1234567 cover Map: MIT-TR-13-image-2 8901234 blank Map: MIT-TR-13-image-3 5678912 unnumbered ; title page Map: MIT-TR-13-image-4 3456789 blank Map: MIT-TR-13-image-5 0123456 unnumbered ; acknowledgement Map: MIT-TR-13-image-6 7890123 blank Map: MIT-TR-13-image-7 4567890 numbered 1 Map: MIT-TR-13-image-8 1234567 numbered 2 Map: MIT-TR-13-image-9 8901234 numbered 3 Map: MIT-TR-13-image-10 5678901 numbered 4 Map: MIT-TR-13-image-11 2345678 numbered 5 Map: MIT-TR-13-image-12 9012345 numbered 6 Map: MIT-TR-13-image-13 6789012 numbered 7 Map: MIT-TR-13-image-14 3456789 numbered 8 Map: MIT-TR-13-image-15 0123456 numbered 9 Map: MIT-TR-13-image-16 7890123 numbered 10 ; original skewed Map: MIT-TR-13-image-17 4567890 numbered 11 Map: MIT-TR-13-image-18 1234567 numbered 12 Map: MIT-TR-13-image-19 8901234 numbered 13 Map: MIT-TR-13-image-20 5678901 numbered 14 ; original says page 16 Map: MIT-TR-13-image-21 2345678 numbered 15 Map: MIT-TR-13-image-22 9012345 numbered 16 Map: MIT-TR-13-image-23 6789012 numbered 17 Map: MIT-TR-13-image-24 3456789 spine Map: MIT-TR-13-image-25 0123456 supporting ; instructions to printer Map: MIT-TR-13-image-26 7890123 doccontrol Map: MIT-TR-13-image-27 4567890 calibration IEEE-167a-1987 Map: MIT-TR-13-image-28 1234567 calibration AIIM-#2 Map: MIT-TR-13-image-29 8901234 agent Map: MIT-TR-13-image-30 5678789 scancontrol REQUIREMENTS ON THE FIELDS Here are the requirements on the field contents that come about in order for a computer program to be able to unambiguously interpret the scanning record: Scanned document record version: must contain exactly the string shown in the example. Publishing department, Technical report label, Document series, Scanner used, Software used, and Operator: each can contain any string of characters. Conventionally, the report label will begin with the string "MIT-" Source: One of the strings "First-generation original", "Later-generation copy", or "PostScript" Image count: Must contain an integer. (The image count is the number of image files created for this document, including calibration, identification, and blank targets, etc. It matches the number in the highest-numbered Map record.) Input form and Suggested print form: Must contain either "single-sided" or "double-sided". Input size and Suggested print size: Must contain two decimal numbers separated by the letter "x". The integers are assumed to be measurements in inches. Date scanned: Month/Day/Year, each component being an integer (leading zeros omitted) and the year being four digits. Resolution and Greyscale depth: must be integers Scanner settings: At this time the only allowed value for this field is "default". File name: A string that is used to prefix the names of all image and data files produced in the course of scanning this document. Note that upper- and lower-case letters in the file names appear in this field exactly as they appear in the actual file name, because on some computer systems upper and lower-case letters are distinct. Scanned document record file name: The exact file name of the scanning record itself with the file name prefix omitted. Note: an optional field containing any desired string. If this field is not empty, a properly constructed browser or print program will display its contents to the reader. Map: Every scanned image is represented by a Map: field. This is the only field that may appear more than once in the scanning record. All Map fields must be together at the end of the scanning record, in order by image number. A Map field consists of an image file name, a checksum, and an image content identifier. The content identifier must be one of the following: scancontrol Scanning project control form control Government control form calibration Test target agent Logo of the scanning agent blank Blank page inserted to maintain duplex printing order spine Image intended for the book spine cover Image intended for the book cover unnumbered Page for which it appears that the publisher did not intend a page number numbered Page of text that the publisher intended to number supporting Material not intended for display or printing The content identifier "calibration" is followed by a string that identifies a calibration test target. The content identifier "page" is followed by a string that represents the publisher's originally assigned page number. Except for numbered pages, the order of the Map fields is taken to be the order in which the images they describe are intended to be printed or displayed. Numbered pages may appear in the map in any order. OPEN QUESTIONS AND THINGS FOR FURTHER THOUGHT OR FUTURE REVISIONS 1. The "Document series" field is probably not needed, because the information is contained in the document label field. 2. Original and input sheet sizes are not separately recorded; perhaps they should be. (Some older TR's were printed in a 6 x 9-inch format from 8.5 x 11-inch originals, and other TR's may be printed from pages produced on cut-sheet printers.) 4. The original document date isn't captured. Should it be? 5. Need a systematic way to insert scanning agent logo on cover page. 6. The content identifiers "scancontrol" and "spine" could be handled as "supporting", with their intended use appearing as comments. 7. There is no support for bringing the reader's attention to the copyright notice. One suspects that some scheme will be needed for a browser to locate that notice and display it. (Since TR's and TM's generally don't have copyright notices, this issue isn't particularly pressing.) 8. It may be appropriate to more systematically record observations by the scanning operator, such as the original is skewed,off-center, or wrinkled or contains a photograph or other non-textual material. ACKNOWLEDGEMENT This note is expanded from a set of ideas originally developed at a Library 2000 group meeting on March 17, 1994. Discussants: Jack Eisan, Mitchell Charity, Ali Alavi, Sally Richter, Mary Anne Ladd, Jeremy Hylton, Geoff Seyon, Eytan Adar, Greg Anderson, Jerry Saltzer. Since that time additional suggestions and ideas have come from Michael Cook, Gillian Elcock, Yoav Yerushalmi, and Andrew Kass. ------------------------------------------------------------------- Changes in the second draft: 1. The page count field is gone, on the basis that the concept is ill-defined. The only count field is now the image count, which is a precise concept. 2. The names of the fields "Original form" and "Original size" are now "Input form" and "Input size" to reflect the observation that these are the form and size of the pages actually scanned. The first-generation originals may have had a different form or size. 3. The names of the fields "Intended print form" and "Intended print size" are now "Suggested print form" and "Suggested print size" to reflect the possibility that the original intent may not be known; this is the best guess of the current publisher. 4. There is a new field "Text quality" with the allowed values "light", "dark" and "normal", to capture observations by the scanning operator. This field is a first cut at communicating this class of information from the scanning operator to the display process and it is likely to evolve. 5. The field "file name generator" is now simply "File name". In addition to using this string as the prefix for each image file name, it will also be used conventionally as the name of the folder or directory that contains all the image and related files for a single report. Also, the conventional value of the field no longer ends with a hyphen; the hyphen is understood to be inserted whenever the name is used as a prefix. 6. The value of the scanning record file name field no longer includes the file name prefix, but the file name itself does carry the prefix. 7. The names of files in Map fields no longer include the filename prefix, but the file names themselves do carry the prefix.. 8. All file names in Map fields now begin with the string "image-". The original thought of distinguishing image- from other- is better handled by the content descriptor associated with each image. 9. Image numbers in image filenames no longer have leading zeros. 10. The "..." convention is gone; it is replaced with a plan that there will be one Map field for every image of the file. (This change is coupled with a plan to use Excel to create the scanned image record. Excel, given two consecutive examples of image-to-page number correspondences, can easily fill in any number of similar such correspondences.) 11. Problems with particular pages, such as the original being skewed or wrinkled or containing a photo, will be recorded as a comment on the Map entry for that page. 12. The "page" content identifier is renamed "numbered" and is defined as containing the publisher's intended page number (assuming the intent can be determined) whether or not the page number actually appears on the page.