Library 2000 Quarterly Progress Report

From:       MIT component of CS-TR project
            (Jerry Saltzer and Greg Anderson)

To:         Bob Kahn, CNRI

Subject:    Quarterly Progress report, October 1--December 31, 1995

A.  Scanning and other operational activities
B.  Technical and research activities
C.  Talks and Meetings

-------------------------------

A.  Scanning and other operational activities:


A.1.  High-resolution scanning

Production Statistics                   reports       pages

  Cumulative total, September 30, 1995*   605         43417

  Scanning in current quarter:

       LCS Reports                         97         13183
        AI Reports                        149          6355
  total for quarter                       246         19538

  Cumulative total, December 31, 1995     851         62955

Please note that the Cumulative total for September 30 has been adjusted
down from the previously reported total by 10 reports and 100 pages--we
found a counting error.

The last several quarterly reports have discussed various changes in
equipment, software, and procedures intended to increase the volume of
high-resolution scanning. The overall ramp-up in scanning volume has been
by a little more than a factor of six.  Here are the quarterly page totals
over the last year (rounded to nearest 500):

     Oct-Dec, 94        3,000
     Jan-Mar, 95        8,500
     Apr-Jun, 95       12,000
     Jul-Sep, 95       14,000
     Oct-Dec, 95       19,500

We believe that we are now running close to the maximum that can be handled
by one operator using a Fujitsu scanner at this resolution.

Reports placed on-line:

                                        reports       pages
  post-scan processing                    145          9000
  cumulative total, December 31, 1995     430         30000

The obvious mismatch between scanning volume and post-scan processing is
being addressed by reconfiguring the post-scan facilities, as described in
the the first and last paragraphs of the next section.

A.2  Hardware and operations

Last quarter we reported that the PowerPC 6600/60 belonging to the Library
2000 group at LCS had been temporarily moved into the Document Services
area to test its effectiveness in restoring images from DAT that had to be
deleted from scanning station  disks before they could be processed into
the smaller form needed for on-line delivery.  Use of PowerPC for
restoration of files from DAT has been quite successful with the exception
of FTP performance which is about half that of the Quadra 840AV.  We are
currently investigating the possibility of acquiring a used Quadra, for
which the available FTP code seems to be better tuned.  We added another 4
Gigabyte disk the the PwerPC to accomodate restores, giving it a total of 8
Gigabytes.   The total number of DATs restored and FTPd during the quarter
was approximately 100, representing  about 15% of the accumulated backlog.
Errors in 3 tapes (3 images) have been detected to date.

Most of the paper handling difficulties reported last quarter on the
Fujitsu scanner have been resolved.  The only exception is a relatively
small number of old materials (first-generation mimeograph copies from the
1960's) which require hand feeding.

Since scanning operations have reached full stride, near the end of the
quarter we decided to turn our attention to a problem we had deliberately
deferred:  scanning the oldest reports in the LCS and AI archives, dating
from the mid-1960's.  These reports are challenging for several reasons.
First, they have been in storage longest and have endured more traumatic
experiences than have recent materials.  Second, they are generally not
printed on acid-free paper, so they are beginning to turn yellow.
(Fortunately, deterioration is not so advanced that they crumble when
handled.)  Finally, they are printed with older technologies (hectograph,
mimeograph, and early offset) so there is a wide variation in appearance.
Early experiments suggest that in some cases individual touchup will be
required to obtain readable images.  Next quarter we should have more to
report on this topic.  (Note:  this effort will undoubtedly have the effect
of reducing the coming quarter's scanning volume.)

The computer used for post-scan processing at LCS was moved to a dedicated
10Mb/s switched Ethernet line with the intent of increasing its ftp
throughput and reducing time spent in debates about whether other network
traffic is interfering with that throughput.  Also, several other services
that were sharing that machine have been relocated to other servers to
reduce competition for its cycles.  These changes should have a payoff in
increased rate of flow of scanned images into the on-line repository.


A.3 Quality Control

Jack Eisan and Mike Cook wrote a problem/resolution document. They are
currently refining and enlarging it by careful observation and analysis of
new problems as they arise.  In addition, this document will serve as a
major piece in the overall documentation of the production workflow which
will eventually appear as a Technical Report.

A.4  Publication plans

In addition to the technical report mentioned in the previous paragraph,
Keith Glavash and Jack Eisan began work on an article entitled "High
resolution image capture in a production scanning environment", which will
describe our experience in this project.

A.5  Transition planning

Building upon the discussions since the earliest days of the project, we
have drafted a transition plan intended to ensure continuity of the CS-TR
service after the experiment ends.  The plan is for the project to move
from its research domain into an operational environment shared by the MIT
Libraries and Information Services.  Keith Glavash, head of the Libraries'
Document Services Department, and Greg Anderson have written a strategy
document that seeks to support the transition of the service based on
near-term contributions by the Library 2000 group, the Libraries, and
Information Systems.  For the longer term, a framework has been described
to sustain and grow the service beyond technical reports and to include
other page image data available at MIT.  Plans are for the strategy to
receive broad discussion and to begin preliminary work on the transition
steps in the Jan. - March quarter.

As locally-developed subsystems mature, and we near the scheduled end of
the project, responsibility for the operation of those subsystems is being
transferred from the research group to more appropriate centers.  This
quarter, Mitchell Charity transformed scan post-processing activity (file
transport, image reduction, and deposit into the on-line repository) from a
labor-intensive prototype into a production system, and transferred
responsibility for its operation from LCS to Document Services.  Similarly,
he transferred operational responsibility for the LCS/AI Reading Room
catalog to the staff of the reading room.


B.  Technical/Research

B.1  Citation accumulation, deduplication, and automatic linking.

Jeremy Hylton's Master of Engineering thesis is beginning to take a very
interesting shape.  He has been completing implementation of the citation
deduplication system described last quarter.  Deduplication is done in two
passes, a gross comparison into potential clusters using a full-text search
engine, followed by a fine comparison of items within each potential
cluster using a string adjacency algorithm.  The collection of materials
being used as a test consists of about 250,000 citations to computer
science research, gathered from various sources around the Internet.
Jeremy has made this system available on the World-Wide Web, at the URL

    http://cslib.lcs.mit.edu/cslib/

As the collection of citations grew in size to its current 100 MB, it
became necessary to switch the search engine used in gross record
comparison from Glimpse to the heavier-duty Library 2000 search engine,
which is designed to deal with bodies of text in the Gigabyte size range.

For the fine matching pass, simple string comparison was too intolerant of
spelling and typographical errors, so Jeremy wrote a string comparison
program that uses n-grams (with n set to 3, so it is actually using
trigrams) to do approximate comparisons of strings. The library computes a
vector magnitude for the difference between any two strings by making a
list of all trigrams that appear in either string and deleting from that
list all trigrams that appear in both strings.  The residual list (which
may include repeated trigrams) is treated as a vector with one dimension
for each different trigram and length along that dimension equal to the
number of times the trigram appears.  The program calculates the length of
this difference vector and compares it with a threshold that itself is
calculated from the vector length of the original strings.   If the
difference magnitude is less than the threshold the two strings are
declared to be similar, otherwise not.

The result of the two passes is an automatic grouping of the raw citations
into about 160,000 clusters of closely related citations (the same paper
cited several different ways, or published in different places).  The
result of a query to this system is a list of composite records, one for
each cluster that contain things that satisfy the query.  Each composite
record merges information from the various underlying citations, with
different merge strategies for different fields.  In case the user decides
the item described in the comosite record is of interest, it also contains
links to whichever repositories are likely to contain an on-line version
(based on the types of documents in the underlying raw records) and to
nearby library catalogs that may lead to physical copies.  The system also
provides the ability to look at the underlying raw records, if there is any
question.

Jeremy also studied false matches made by the duplicate detection
algorithm. False matches occur when the matching algorithms place two
actually unrelated records in the same cluster. Since most uses of clusters
would merge clustered records together, such false matches would
effectively hide one or the other of the falsely matched records, or lead
to creation of a misleading composite record. To mitigate this problem, he
investigated various methods of identifying those sets of duplicate records
that contain an unusally wide variation in the source records.

The thesis is scheduled to be finished by the beginning of the Spring term;
we anticipate turning it into a technical report.


C.  Talks and Meetings

C.1  Talks

Jerome H. Saltzer.  Replication for long-term persistence.  Talk given at
Stanford University Digital LIbrary group (October 25, 1995), Xerox Palo
Alto Research Center (October 24, 1995), and Digital Equipment Corporation
Systems Research Center, Palo Alto (October 26, 1995).

Gregory Anderson. The Library and the Network:  Barriers, benefits, and
outcomes.  Seminar presented at the University of Massachusetts, Amherst,
MA., December 6, 1995.

C.2  Meetings

Jerry Saltzer spent the month of October as a visitor to the Digital
Library group at Stanford University, as the guest of Hector Garcia and
Rebecca Lasher .  Most of the time was spent discussing research work with
students and staff of the group, and giving talks about work at M.I.T.

At the October, 1995, Educom conference, Greg Anderson met with Bill Arms
and Mary Berghaus Levering of the Copyright Office to discuss MIT
participation in CNRI's efforts for electronic copyright registration.

Jeremy Hylton and Mitchell Charity attended the Vannevar Bush Symposium at
MIT on October 12 and 13, 1995.
Return to Library 2000 home page.