Library2000 and Imaged LCS and AI Reports

Library2000 and Imaged LCS and AI Reports

LCS and AI technical reports and memos are being scanned, in a joint venture between the Library 2000 group of LCS at MIT, the MIT Libraries (in particular its MIT Document Services), and the publications offices of AI and LCS . (AI also has its own scanning program.)

How are we funded? The project is funded by the Research on Linking Electronic Libraries of Scientific and Technical Information project of the Computing Systems Technology Office (CSTO) of the Advanced Research Projects Agency (ARPA) of the US Department of Defense, through a grant (MDA972-92-J1029) to the Corporation for National Research Initiatives (CNRI), as part of its Computer Science Technical Report (CSTR) project. Support for Library2000 has also been received from IBM and DEC.

Why are we scanning? To make old paper reports available online, and to preserve reports in a somewhat archival form. One may expect images, because of their brute-force simplicity, to be understandable years after the layout language of the moment is forgotten, and to avoid the variations in rendering which often accompany the more powerful languages.

What are we scanning? Our objective is to have images of all reports and memos after 1993, and eventually all AI and LCS reports and memos back through Project MAC, begun in the 1960's. With paper documents, we are scanning the earliest-generation originals we can find. These are accompanied by DoD coversheets, and quality control test patterns. With documents for which authoritative postscript exists, we are generating the images directly. Both AI and LCS publications now require submissions be in entirely in postscript, so future paper scanning is not anticipated.

What is the result? We collect 600 bpi (bit-per-inch) black&white images, stored as CCITT G4 compressed TIFF files. These can be shrunk to lower resolution grayscale for display, and embedded in PostScript for printing. The result printed on a 600bpi printer is comparable to good photocopying. We are currently working to make these resources easily available to the public.

How are we scanning? We set out to explore the issues of production archival scanning of originals. As one generally wishes to touch originals only once, we try to capture as detailed an image as possible. Thus we are scanning at 400bpi 8bit grayscale, having been unable to find a production scanner capable of more. This produces 15 MByte page images, which we archive to tape, and filter down to 600 bpi black&white for service. This filtering is similar to that which occurs within many scanners, which optically scan at 400 gray, and then deliver a processed 600 b&w image. The advantage of capturing the grayscale is that it permits later going back and applying different reduction algorithms. The disadvantage is the volume of raw data (gigabytes per day). We have mainly been using a Fujitsu 3091G scanner on a Mac Quadra 840AV. Among our greatest difficulties have been getting a clean grayscale response, and simply moving the gigabytes off the Mac.

How are we imaging PostScript? We are using Aladdin Ghostscript. Being non-commercial, we can easily use the newer versions available directly from Aladdin, rather than the delayed versions they make available through GNU.

What quality control test patterns are being used? We are currently using the IEEE Std 167A-1987 Facsimile Test Chart ( 75bpi 8bit (120KB) 600bpi 1bit (200KB), 400bpi 8bit (9MB, 14MB uncompressed), ), and the AIIM Scanner Test Chart #2 ( 75bpi 8bit (120KB) 600bpi 1bit (200KB), 400bpi 8bit (9MB, 14MB uncompressed), ). Neither is particularly well adapted for our purposes. Our search continues, as does work in standards bodies to produce better patterns (called "targets").

What is a DoD cover sheet?

What about covers?

What are these publication series? [Find a copy of my email!] Project MAC had three series, Technical Report, Memo, and an internal Memorandum. AI has two, Technical Report, and Memo. LCS has three, Technical Report, Memo, and Research Seminar Series. When the Project MAC AI group spun off to become the AI Lab, its MAC TRs became the first AI TRs. Its MAC Memos ...? When Project MAC became LCS, the MAC TRs and MAC Memos became LCS TRs and LCS Memos, with the numbering preserved.

What are report ids? Report ids are the identifiers assigned by the publications offices. There form has been fairly stable over the years, the largest change accompaning the breakup of Project MAC. We use the current forms with all reports. These are "AI-TR-#", "AIM-#", "MIT/LCS/TR-#", and "MIT/LCS/TM-#". AIM's have a variant form "AI Memo #". Revisions are identified (sometimes) by appending a letter. The starting letter (a or b), means of attachment (adjacent, space, period), and case (a or A) all vary. Within the project, there has been an attempt to standardize on adjacent and lowercase, rather than period which is used as a filename component separator (as in "LCS-TM-10.600x1cG4.pg0010.tif").

What reports have been published?

How is the mit cstr ftp site organized? has directories for series (AIM, AITR, LCS-RSS, LCS-TM, LCS-TR), with subdirectories for number ranges. AITR should probably have been AI-TR. Ah well. As AIM and AITR share a number space, it might be useful to create a combined subdir. Perhaps. There is no MAC-TR or MAC-TM at the moment because the LCS-TR and LCS-TM series absorbed them.

AIM:- 0099, 0199, 0299, 0399, 0499, 0599, 0699, 0799, 0899, 0999, 1099, 1199, 1299, 1399, 1499, 1500.
AITR:- 0299, 0399, 0499, 0599, 0699, 0799, 0899, 0999, 1099, 1199, 1299, 1399, 1499, 1599.
LCS-RSS:- 0099.
LCS-TM:- 0099, 0199, 0299, 0399, 0499, 0599, 0699.
LCS-TR:- 0099, 0199, 0299, 0399, 0499, 0599, 0699.
Range directories contain individual report directories, named with the report id (or in the case of AI-TR, with AITR replacing AI-TR).
Report directories contain files of the form (id).(format). Formats include bib, ps, and 600x1cG4. When the document representation in a particular form requires more than one file, the files are moved into a similarly named directory. A type suffix (.ps, .tif) is sometimes added to helpfully convey type information. The current image characterization is (dpi-resolution)x(depth-in-bits)c(compression-scheme). A format should probably also be added. As we start to accumulate reports with multiple versions of a single flavor, something like "d2" (dataset 2) needs to be added. So, perhaps "LCS-TM-10.600x1cG4d2.pg0001.tif".

What are scan records? doc.

Return to Library 2000 home page.