From:       MIT component of CS-TR project
            (Jerry Saltzer and Greg Anderson)

To:         Bob Kahn, CNRI

Subject:    Quarterly Progress report, April 1--June 30, 1996

A.  Scanning and other operational activities
B.  Technical and research activities
C.  Publications

-------------------------------


A.  Scanning and other operational activities:


A.1.  High-resolution scanning

Since this project is nearing its end, we shifted priorities somewhat
this quarter.  Regular scanning was discontinued on May 30 and
attention shifted to preservation copying.  Section A.3, below,
describes the details of the preservation copying plan.

Production Statistics                   reports       pages

Scanning operations:

  Cumulative total, March 31, 1996        997        79,935

  Scanning in current quarter:

       LCS Reports                         35         5,063
        AI Reports                         18         3,183
  total for quarter                        53         8,246

  Cumulative total, June 30, 1996        1050        88,181


Reports placed on-line:

  post-scan processing                    101        15,733
  cumulative total, June 30, 1996         667        61,523

Preservation copying, begun June 10, 1996:

  DAT to CD conversion                     16            58 CD's
  cumulative total, June 30, 1996          16            58 CD's

The 1,050 reports scanned represent about half of the 2000 total
accumulated output of technical reports and technical memoranda of the
Artificial Intelligence Laboratory and the Laboratory for Computer
Science.  We had originally hoped to scan all of those 2000 reports,
but the process of learning how to manage the workflow of
production-quality high-resolution scanning meant that our scanning
rate started out low and grew slowly.  If the scanning rate achieved
during the last three quarters of the project had been possible from
the beginning, the original goal would have been accomplished with
capacity to spare.


A.2  Hardware and operations

All of the outstanding hardware problems mentioned in the previous
quarterly report have been resolved.  As planned, a PowerPC 8500 and a
4 Gbyte disk were acquired and placed in service to increase our
ability to restore archive tapes for post-scan processing.  In
addition, a Yamaha/APS CD recorder was acquired and placed into
service as part of the preservation copying project described in the
next paragraph.


A.3  Preservation copying

During the lifetime of the project, we have accumulated off-line some
1.3 Terabytes of raw, high-resolution grey-scale, image data.
(Because of our use of 400 pixel per inch grey-scale scanning, this is
probably a substantially larger repository than the other CS-TR sites
accumulated.)  One of our concerns is to assure that this data is
preserved for future use.

In our standard scanning workflow, the high-resolution image data is
processed down to a lower-resolution form for on-line delivery,
display, and printing with current technology, and the original
high-resolution image data is copied to Digital Audio Tape (DAT) for
future use when on-line storage capacities and processing speeds
increase. Since DAT has an expected media lifetime of only a few
years, and a probable technology lifetime of no more than a decade,
longer-term preservation of the data requires transfer to a more
stable medium.

As mentioned in the previous quarterly report, we have recently
concluded that recordable CD (CD-R) has a suitable balance of
properties for preservation:  the increasingly popular use of CD-ROM's
for computer applications ensures that reading technology will be
available for a long time, and studies suggest that the media itself
has a possible lifetime of 70 to 100 years.  The lifetime requirement
at hand is probably only a decade or two, but no other off-line medium
provides an intermediate capability, CD-R costs are relatively low,
and the projected lifetime provides a margin of safety. We have
therefore begun transfering the data from DAT to CD-R, using a
Yamaha/APS 4X CD recorder attached to the Macintosh PowerPC 8500.  For
recording software we are using Astarte "Toast CD-ROM Pro", version
3.0.

A.3.1  Blank media

Three recent white papers have explored issues of media compatibility
and longevity.  A paper from Los Alamos Scientific Laboratory examined
error rates of disk blanks from different vendors, of different type,
and using different recorders. The primary conclusion from that
study is that disks manufactured using cyanine dye (sometimes called
gold/green disks) have lower error rates when written at slow
recording speeds, and that disks manufactured using phthalocyanine dye
(sometimes called gold/gold disks) have fewer problems when written at
higher recording speeds.

A second white paper, from TDK, describes heat-accelerated life tests
on cyanine-dye disks, and the third white paper, from Kodak, describes
heat-accelerated life tests on phthalocyanine-dye disks.  These
studies suggest that cyanine-dye disks may have a lifetime of 70 years
or more if they are stored in suitably dark conditions, and that
phthalocyanine-dye disks may have a lifetime of 100 years or more and
are less sensitive to bright light.

Anecdotal evidence from on-line news groups suggests that some vendors
of cyanine-dye blanks have had runs of bad media.  Taking all these
considerations into account, we have chosen to adopt
phthalocyanine-dye disks as a standard.  This choice currently
restricts us to two original equipment vendors, Mitsui and Kodak.
Since Kodak disks are manufactured with an extra layer of protective
lacquer and with a machine-readable bar-code containing the disk
sequence number, we have specified the use of Kodak blank media.

A.3.2  Data format

We have chosen to record data on the CD in the ISO 9660 level 1 format,
an interoperable format that can be mounted as a file system by most
current computer systems equipped with a CD reader.  We are recording
Image files in TIFF, a self-describing format that is in widespread
use.  Text files are each recorded three times using three different
end-of-line conventions, one for the Macintosh (carriage-return
characters separate lines) one for UNIX platforms (line-feed
characters separate lines, and one for DOS/Windows platforms
(carriage-return and line-feed characters separate lines.)  In
addition to the original scan record for each document and the scan
record specification, a text file describing the background of the
project, the workflow and software that we used, and the format and
naming conventions used on the CD is written on every CD as a file
named "README".

The goal of these details is that each CD be completely
self-describing, and we have in addition attempted to provide enough
information that if an error in workflow or a bug in the software has
muddled the data, a future reader has a reasonable chance of figuring
out what happened and possibly compensating.  We have verified that
the CD's are readable, the file systems are mountable, the text files
can be read with any standard editor, and the image files can be
opened and displayed properly on a Macintosh, an AIX/UNIX system, and
a DOS/Windows 3.1 system.

A.3.3  Workflow

The overall workflow involves restoring the contents of one or more
Digital Audio Tapes to the hard disk of the Macintosh, then
rearranging and renaming the files in accordance with ISO 9660
standards.  Mitchell Charity developed a script that does this file
rearrangement and renaming, taking into account the problem that most
technical reports occupy more than one CD.  One CD can hold 43 of our
high-resolution images (we chose not to compress the images, to avoid
the requirement that a future reader figure out how to decompress
them.)

Approximately 2,600 blank CD's will be required to complete the
preservation project.  Our plan is that the Document Services office
will continue to perform restores from DAT and recording to CD as a
background activity in the months after the contract with CNRI has
expired.  To this end, we attempted, in the last months of the
project, to use our remaining funds to acquire blank media.  As it
happens, a world-wide shortage of blank CD-R media developed just
before we began that acquisition, and it was necessary to do
considerable shopping to locate supplies.  We now have commitments
from suppliers that appear to be sufficient to exhaust available
funds.  Approximately 1600 CD blanks are either in hand or scheduled
for delivery; we intend to ask the Labaoratory for Computer Science
for funds for to acquire the remaining blanks.

A.3.4  CD labeling

Labeling of CD's used for preservation turned out to be a problem.
There are two techniques commonly used in the industry, but both are
questionable when it comes to long-term archiving.  Stick-on labels
are available, but they have not been developed with lifetimes
comparable to that of the CD media, and there is risk that they (or
their glue) will deteriorate.  Writing directly on the surface of the
CD with a felt-tip pen is the other commonly-used approach, but there
are reports of damage to the CD (and consequent loss of data) from
some ink solvents.  In another white paper on the topic of data
preservation, Kodak recommends that only pens specifically recommended
for the purpose by the CD blank manufacturer be used.  But when asked
for a recommendation of a pen compatible with its own disk blanks,
Kodak hesitated to identify any specific pen that it considers safe.

In response to these labeling problems, we chose to leave the CD's
unlabeled.  Instead, we use CD blanks that come with unique
manufacturer-supplied serial numbers in both human-readable and
bar-code forms, and create a paper label for the CD jewel box that
identifies the technical report and connects it with the serial number
on the CD.  For this purpose, Mitchell Charity developed a web page
form that accepts a report number, recording date, and CD serial
number, and returns an appropriate PostScript label with matching
human-readable and bar-code serial numbers for printout and insertion
in the CD jewel box.  These labels are printed on acid-free paper and
because they do not touch the CD blank, we anticipate that they will
not compromise its longevity.


A.4  Continued dissemination.

The second important area of transition planning is to arrange for
continuing dissemination over the Internet, via Dienst and FTP, of
about 50 Gigabytes of processed, display-ready and print-ready page
images.

As described in the previous quarterly report, we have proposed to
transfer the dedicated computer that currently does this job to MIT
Information Services (IS) and that the library and IS should jointly
take over its operation and care of the data so as to provide
continued dissemination of the page images. This proposal is still
being reviewed by the two organizations; the recent appointment of a
new director of the MIT Libraries has slowed down the review, and at
this time we are still uncertain of the fate of this proposal.


B.  Technical/Research

This quarter was occupied heavily with the operational and
preservation planning activities described in part A, so there is only
one item to report on technical and research topics.


B.1  METHOD AND APPARATUS FOR RETRIEVING INFORMATION FROM A DATABASE.

Mitchell Charity and Eytan Adar filed a patent with the above title on
the technique that they jointly developed to do an approximation of
nearest-match search using repeated invocations of a boolean search
engine with randomly-chosen terms from the original search string.
Because the patent is in the area of Information Retrieval, in which
none of the group has special expertise, there was considerable
skepticism that the idea could possibly be new.  However, consultation
with a few experts in that field suggested that it may actually be a
novel idea, so after some agonizing, M. I. T. proceeded with a patent
application.


C.  Papers

Gregory Anderson, Rebecca Lasher, and Vicky Reich.  The Computer
Science Technical Report (CS-TR) Project: Considerations from the
Library Perspective. This report, previously published as Stanford
University Technical Report CS-TR-95-1554, was in the current quarter
also made available as M. I. T. Laboratory for Computer Science
Technical Report 693.  (In a previous quarter, a revised version of
this report was also published in the on-line journal Public Access
Computer Systems Review 7(2), 1996).

Return to Library 2000 home page.