From:       MIT component of CS-TR project
            (Jerry Saltzer and Greg Anderson)

To:         Bob Kahn, CNRI

Subject:    Quarterly Progress report, January 1--March 31, 1996

A.  Scanning and other operational activities
B.  Technical and research activities
C.  Talks and Meetings


A.  Scanning and other operational activities:

A.1.  High-resolution scanning

Production Statistics                   reports       pages

  Cumulative total, December 31, 1995      851       62,955

  Scanning in current quarter:

       LCS Reports                         93        11,003
        AI Reports                         53         5,977
  total for quarter                       146        16,980

  Cumulative total, March 31, 1996        997        79,935

Reports placed on-line:

  post-scan processing                    135        15,500
  directly converted from postscript        4           400
  cumulative total, March 31, 1996        570        46,200

We believe that by the end of the quarter we had identified and found ways
past most of the bottlenecks that have plagued post-scan processing, and
that it in the absence of network problems and human errors it is now in
principle capable of running at something close to 60,000 pages/quarter.

A.2  Hardware and operations

Last quarter we reported that use of a PowerPC borrowed from LCS for
restoration of files from DAT and subsequent file transfer for post-scan
processing has been quite successful with the exception of FTP performance,
which was about half that of the Quadra 840AV, which is used both for
scanning and FTP.  We have now established that upgrade of the operating
system to a release that just became available and that contains a new
network software package fixes this problem.  On that basis, and as part of
the transition and preservation plan described below, we are acquiring a
PowerPC 8500 for this purpose.  We also acquired an additional 4 GByte disk
to increase our buffer capacity for archive DAT -> Disk -> FTP transfers.

A series of intermittant problems on one of the 4 GByte disks on the Quadra
840AV and on one of the DAT drives on the PowerPC led us to return those
devices under warranty for repair or replacement.

A.3 Quality Control

We are currently depending on student help for this reviewing the quality
of scanned images.  The student who had been doing it ran out of time and a
different student is now picking up that job.

A.4  Preservation

Transition planning for the end of the CS-TR contract has continued at the
next level of detail.  One major area of planning is for the preservation
of the 1.2 Terabytes of raw data that we have accumulate, consisting of 400
pixel/inch 8-bits/pixel greyscale images of near-photographic quality.
Immediately following scanning, this data was transferred to Digital Audio
Tape as the first step in preservation.  Unfortunately, DAT has an expected
media lifetime of only a few years, and a probable technology lifetime of
no more than a decade, perhaps less.

We have been looking for a long time for suitable media for longer-term
preservation, and have recently concluded that recordable CD has a suitable
balance of properties:  the increasingly popular use of CD-ROM's for
computer applications ensures that reading technology will be available for
a long time, and recordable CD media now have an estimated lifetime of 70
to 100 years.  For that reason we are gearing up to begin transfer of the
data from DAT to CD-R, and we have ordered a Yamaha/APS 4X CD recorder to
attach to the Macintosh PowerPC 8500 mentioned earlier in connection with
FTP operations.

Unfortunately, that technology did not become convincingly viable until
just now, and the media transfer will take some time, so there is no way
that it can be completed before the end of the research contract on May 31,
1996.  However, we believe that with the equipment and procedures in place,
the cost of completing this transfer is actually quite low:  it consists
mostly of inserting media, launching the appropriate program, and checking
the results half an hour later.  Our intention is to acquire the equipment
and blank media while the contract is still in place, and then continue the
media transfer activity on a low priority after the contract is over.

A.5  Continued dissemination.

The second primary area of transition planning is to arrange for continuing
dissemination over the Internet, via and Dienst and FTP, of about 50 Gbytes
of processed, display-ready and print-ready page images.

The server that provides public dissemination of the reports consists of an
IBM RS-6000 computer acquired for this project; it contain 512 MBytes of
RAM and something over 50 Gbytes of disk storage.  We have been working on
a plan that involves transfer of this computer to MIT Information Services
(IS) and that the library and IS jointly take over its operation and care
of the data so as to provide continued dissemination of the page images.
This proposal is under consideration by both organizations.

In connection with both preservation and continued dissemination, we
inquired of the plans of the other four CS-TR project participants.  We
have not heard from Carnegie-Mellon yet, but Stanford and Berkeley have
both been working out arrangements that these two jobs should be
transferred to their respective libraries.  At Cornell, the Computer
Science department is taking responsibility for both jobs following the end
of the CS-TR funding.

B.  Technical/Research

As predicted, Jeremy Hylton completed his Master of Engineering Thesis in
February.  The thesis was described in some detail in the previous
quarterly report; here is the abstract of the finished thesis:

Identifying and Merging Related Bibliographic Records, by Jeremy A. Hylton.
Master of Engineering Thesis, MIT Department of Electrical Engineering and
Computer Science, February 13, 1996.

Abstract.   Bibliographic records freely available on the Internet can be
used to construct a high-quality, digital finding aid that provides the
ability to discover paper and electronic documents. The key challenge to
providing such a service is integrating mixed-quality bibliographic
records, coming from multiple sources and in multiple formats. This thesis
describes an algorithm that automatically identifies records that refer to
the same work and clusters them together; the algorithm clusters records
for which both author and title match. It tolerates errors and cataloging
variations within the records by using a full-text search engine and an
$n$-gram-based approximate string matching algorithm to build the clusters.
The algorithm identifies more than 90 percent of the related records and
includes incorrect records in less than 1 percent of the clusters. It has
been used to construct a 250,000-record collection of the computer science
literature. This thesis also presents preliminary work on automatic linking
between bibliographic records and copies of documents available on the

The thesis is available in Postscript, gzipped Postscript, and text
(without the figures) and it will appear as LCS Technical Report
MIT/LCS/TR-678.  Links to each of these forms can be found under
"publications" and then "theses" on the Library 2000 home page,
.  The Digital Index for Works in
Computer Science demonstrates the results of the thesis. It is an index of
255,000 bibliographic records, which have been processed to identify
records that describe the same work.  A link to that demonstration will be
found in the same place.

We have also performed a significant act of technology transfer; after
completing his thesis, Jeremy moved to the Baltimore/Washington area and is
now on the technical staff at CNRI.

C.  Papers and Meetings

C.1  Papers

Gregory Anderson, Rebecca Lasher, and Vicky Reich.  The Computer Science
Technical Report (CS-TR) Project: A Pioneering Digital Library Project
Viewed from a Library Perspective.  Public Access Computer Systems Review
7(2) (1996).  

Here is a review of that paper in Current Cites:

  -- Don't let the apparent focus of this article on computer science
technical reports prevent you from reading this very informative
description of a cutting-edge digital library project. The down-to-earth
advice and information regarding production scanning, copyright issues, and
other topics make this much more than simply another case history.  Another
thing that makes this article a must read is that it describes a working
model for simultaneous searching of a physically distributed archive -- a
model that is an essential one for creating digital libraries that
encompass many collections and yet appear as one to the user. If you wish
to know more, the article references have a number of good pointers, to
which I would add the article in D-Lib Magazine "Creating a Networked
Computer Science Technical Report Library" by James Davis
[], which provides more
technical detail. -- RT

Jeremy A. Hylton. Identifying and Merging Related Bibliographic Records.
Master of Engineering thesis, M. I. T. Department of EECS, June, 1996
(actually completed in February, 1996). To appear as LCS Technical Report

Jeremy Hylton.  Access and Discovery: Issues and Choices in Designing
DIFWICS.  DLIB magazine (March, 1996 issue) Corporation for National
Research Initiatives. 

C.2  Meetings

Greg Anderson attended the DLIB Repository Interoperability Workshop, CNRI,
Reston, VA, March 11-12, 1996.

Greg Anderson also attended ACM Digital Libraries '96, Bethesda, MD, March
20-23, 1996, where he participated in a panel presentation on the
Repository Interoperability Workshop.

Return to Library 2000 home page.