Library 2000 Quarterly Progress Report

From:       MIT component of CS-TR project
            (Jerry Saltzer and Greg Anderson)

To:         Bob Kahn, CNRI

Subject:    Quarterly Progress report, July 1--September 30, 1995

A.  Scanning and other operational activities
B.  Technical and research activities
C.  Talks, papers, and Meetings

-------------------------------

A.  Scanning and other operational activities:


A.1.  High-resolution scanning

Production Statistics                   reports       pages
 
  Cumulative total, June 30, 1995         499         29430

  Scanning in current quarter:

       LCS Reports                         61          7302
        AI Reports                         45          6685
  total for quarter                       106         13987

  Cumulative total, September 30, 1995    605         43417

Reports placed on-line:
                                        reports       pages
  post-scan processing                     64          7000
  postscript conversion                     2           170
  quarter total                            66          7200
  cumulative total                        300         23000


A.2  Hardware and operations

The Hewlett-Packard 4SiMx, 600 dpi printer acquired at the end of last
quarter to support scanning quality control was installed and put into
service. Mitchell Charity made necessary changes in software to accomodate
printing.

The PowerPC 6600/60 belonging to the Library 2000 group at LCS was
temporarily moved into the Document Services area to test its effectiveness
in restoring images from DAT that had to be deleted from scanning station 
disks before they could be processed into the smaller form needed for
on-line delivery.  The external disk drive on this PowerPC, an APS-provided
Seagate 4 Gigabyte unit, developed trouble and is being replaced.  While
waiting for the replacement, one of the hard disks from the scanning
station has been transferred to the PowerPC.  Tests will also be made to
determine whether or not a PowerPC might help reduce other scanning
bottlenecks, such as scanner manipulation or checksumming.

Minor modifications were made to the Fujitsu 3096G scanner document feed
mechanism to help eliminate static and provide better page seperation. 
Early results show positive improvement in both areas. 

Last quarter we reported that one DAT drive was failing and two others were
providing unexpectedly  low performance.  All three have now been replaced
by the vendor (APS) under warranty and reinstalled, and all are now
operating properly.

A.3 Quality Control

Mitchell Charity and Jack Eisan have begun a review of the  FTP and
processing logs to improve controls and actions based on errors reported.
Currently in place is an e-mail messaging system which reports on any
activity from the logs. All messages generated are being archived at
Document Services.

Quality control reviews of scanned materials are now part of the standard
workflow; the review process is finding a small number of problems with the
scanned images, but for the most part it appears that the procedures in use
produce materials that pass review.


B.  Technical/Research

B.1  New work in progress:  Citation accumulation, deduplication, and
automatic linking.

Jeremy Hylton, who has been working on deduplication, is  building a
collection of citations of computer science literature from existing bibtex
files, as well as a variety of web pages, including authors' homepages,
which often contain bibliographic citations.  His system is intended to be
a search/discovery system that allows queries across those citations; the
results of those queries will be citations that include information
accumulated from duplicate records -- and access to the papers when they
are available on-line.

By assembling a large collection of citations, he hopes to have a
relatively comprehensive coverage of the computer science literature
without a coordinated effort.  One of the goals of the system is to
minimize the need for coordination and standardization among information
providers.  Free text search is one of the key mechanisms that avoids the
need for standardization.

To support this project, in the current quarter Jeremy:

1.  Implemented a deduplicator that recognizes records with the same author
and title and creates union records that organize the duplicates by
publication type (conference paper, journal article, technical report,
etc.).  The union records contain some fields unified across all duplicate
records (like title) and some fields only across records of a particular
type, like the date or publisher.

2.  Completed a draft implementation of a Web walker that identifies and
indexes Web pages that have pointers to computer science papers and
articles.  The system selects pages by following links from computer
science department home pages; the strategy for choosing links to follow is
weak and includes many irrelevant pages.   A second draft implementation is
intended for next quarter.

3.  Added a link from his search system to the LCS/AI reading room catalog.
 When the search turns up a paper that appeared in a conference proceedings
or journal, the system will formulate a reading room query to see if that
proceedings is in the reading room.  The goal here is to provide integrated
access to physical and digital documents, because so much information does
not exist in digital form.

4.  Explored how to handle the abbreviations that appear in bibtex records,
and how to impose some uniformity for things like journal names.  It
appears that the best approach may be to create an authority list for
journals, publishers, and conference proceedings; the system would link the
journal name in a bibtex record to a particular entry in the authority
list.


B.2  Document Tracking System

Over the last year Jeremy Hylton, with much input from the scanning
operations team, has developed a Document Tracking System with a 
World-Wide Web interface.  This system helps us manage the scanning
operation by identifying major milestones in the scanning process,
providing a shared database that records when those milestones have been
reached, and giving World-Wide Web browsers access to the current state of
the database. The two currently-used views of the database are a detailed
log of action on a particular technical report or a table of all milestones
achieved for each TR in a particular series.

The summary of milestones for an entire series is a particularly valuable
way to track scanning progress and identify potential problems.  Each
milestone, e.g. scanning completed, original document returned, etc., is
displayed horizontally across the table, along with the date the last
milestone was reached.  This view helps identify reports that have stalled
somewhere in the scanning process -- for example, a report that was scanned
but never processed for distribution.  The table summary also makes it
possible to identify which reports have been scanned -- and which reports
have been skipped for one reason or another. 

As reported in earlier quarters , the tracking system has been in
production use at Document Services since January, and we incorporated it
into our processing programs this spring, but we have not described it in
any detail until now.

When we first designed the system, we identified 7 milestones that each
document would reach as it was moved through the scanning operation:

1. Original located
2. Original delivered to Document Services
3. Report scanned and taped
4. Original returned
5. FTP from Document Services and processed for quality control
6. Quality control #1: The scanned images were compared with
   a printed copy of the document to verify they were scanned
   correctly.
7. Quality control #2: The image processing routine correctly
   created the images for distribution and placed them in our
   on-line delivery site.

When each milestone is reached, a user (or user program) must check off
that entry in the database. The database records who checked off a
particular milestone and when they did so.

We also included some extra fields the database:
- an error checkoff that would flag the report as being a problem
- a comment field to describe errors or other issues more fully
- a rough page count (from the bibliographic citation) for use in
generating reports
- the id of the tape the report was stored on

In practice, we have not updated all of the checkoff fields. (It is not
clear if we identified milestones that did not need to be checked off or if
we have failed to follow our intended process completely.) One of the
problems may be that each milestone is an item that must be marked as
completed, but some steps may only be interesting as steps that can not be
completed:  If we assume that all originals can be located, then perhaps
the "Original located" milestone should actually be a field that is only
marked when an original can *not* be located. 

The database and the Web interface were implemented in Perl, and use both a
DBM file for fast access and a transaction log for long-term reliability. 

The system includes a library of Perl functions for performing queries
across the database, updating entries, and displaying results.  The library
is intended to make it easier to develop custom reports based on the
database contents -- for example a list of all the reports that have
reached the "scanned and taped" milestone along with their page counts.

We did not use an off-the-shelf database program because we were not
familiar with any affordable and simple systems that would allow multiple
users to read from and write to the database concurrently.  Our own
implementation was relatively straightforward because of the nature of the
database:

- each milestone is checked off only once, and nothing is deleted (this
database requires just append-only or monotonic-change semantics, which
simplifies coordination)
- writes to the database occur infrequently (about 100 per week)
- it is unlikely that more than one user would try to check off a single
milestone


C.  Papers and Meetings

On September 26, Greg Anderson met with Vicky Reich and Rebecca Lasher at
Stanford University to work on the final version of their technical report
on librarianship issues raised by the CS-TR project.  They have submitted a
version of this TR to Public Access Review, an electronic journal for
electronic library issues, and it has been accepted.

On September 28, 1995, Jerry Saltzer began a month-long visit to the
Digital Library group at Stanford University, led by Hector Garcia-Molina
and Terry Winograd.  He will be participating in working
sessions,discussing projects with students,  and describing work at M.I.T.
Return to Library 2000 home page.