Library 2000 Quarterly Progress Report

From:       MIT component of CS-TR project
            (Jerry Saltzer and Greg Anderson)

To:         Bob Kahn, CNRI

Subject:    Quarterly Progress report, July 1--September 30, 1994

A.  Scanning and other operational activities
B.  Technical and research activities
C.  Talks, papers, and Meetings

-------------------------------

A.  Scanning and other operational activities:

Scanning of LCS and AI technical reports moved from experimental
trials into full-scale production at MIT Document Services during the
quarter, though the production rate is lower than we hoped for at this
stage.

The hardware compatibility problems with the new Fujitsu 3096G scanner
mentioned in the previous quarterly progress report have now all been
resolved with considerable effort on the part of Andrew Kass in
pestering Fujitsu and Blue Ridge technical report staffs.  Most of the
problems turned out to be bugs in the Blue Ridge scanning software
package, OPTIX, which we are using in a configuration that they have
little experience with.  At this time the Fujitsu scanner, run under
OPTIX software, is being used for single-sided originals, while the HP
IICX, using Cirrus software driven by an AppleScript developed by
Eytan Adar, is used for two-sided documents.  During the quarter, 62
Technical reports and memos were scanned, totalling 2,161 pages.  The
cumulative total is now 102 documents, and 6104 pages.

The work flow of the scanning process consists of preparation of a
scanning record, scanning at 400 dpi 8 bits/pixel, immediate
checksumming, compression, archive to DAT, FTP to a UNIX server,
decompression, duplicate-archive to an Exabyte tape, conversion to 600
dpi one bit/pixel, and printout for quality control check, at which
point the image finally comes to rest in a storage server where it is
available to the world.  The large physical size of the raw images (16
MBytes/page) produces a bottleneck at every stage of this process, and
development effort is going into finding ways around and through the
bottlenecks.  One bottleneck is disk buffer space at the scanner.  To
accomodate anticipated volumes based on scanner throughput
capabilities, the quotes have been obtained for an additional DAT
drive and 8 GB of additional disk space, making the total disk
complement 12 GB.

In addition, physical paper transport considerations have been
formalized.  Forms were developed for document and process control,
and after some refining they are in use between the Laboratory for
Computer Science, the Artificial Intelligence Laboratory, and Document
Services.  In addition there is now a regular schedule for document
pick up and transport.

A group of key individuals concerned with scanning issues has been
identified and are now meeting on a regular basis. This group
comprised of LCS and DS personnel will focus primarily on those issues
related to the activities of DS, the hardware and software and all
process control and quality control topics.


B.  Technical/Research

B.1 Scanned image record and the browser.  A series of discussions
explored how to browse a document in scanned image form.  The outcome
was a conclusion that information in addition to the scanned images
(such as image number to page number correspondences, identity of
calibration targets, original document size, whether or not it was
intended to be on one-sided or two-sided paper) is required to do a
good job of either display for browsing or printing, and that a
standard form is needed to record this information.  The scanned image
record mentioned in the previous quarterly progress report is a first
approximation to the required information, but some work on it will be
needed before it can be used as a standard basis for browsing
image-form documents.  Development of a standard scanned image record
is an ongoing project.

B.2 Unified CS-TR interface.  Gillian Elcock implemented an extension
to the testbed system that allows it to interface more gracefully with
the World-Wide Web.  We previously reported building a World-Wide Web
interface to the reading-room catalog of technical reports and also a
WAIS source for our union catalog of CS-TR bibliographic records.  The
unified interface fits under these two services; whenever a search
turns up a technical report of interest, the unified interface checks
the Navigation system described in section B.6 to see whether or not
scanned images are available for that report.  If so, it fabricates on
the fly a WWW page that contains one button for each page image, thus
providing a kind of simple browser for use from the Web.

B.3 Tulip Viewer.  A cooperative project with MIT Information Services
has been underway to develop a fast TIFF document browser for G4 FAX
images.  Work on the project has been led by Bill Cattey of MIT IS,
with ideas and input from Mitchell Charity.  This quarter, Yoav
Yerushalmi modified the viewer to make it compatible with the Library
2000 testbed images from the CS-Tr project.

B.4 Circulation system.  Geoff Lee Seyon developed and deployed an
on-line document circulation system for the Library 2000 testbed
system in use at the LCS/AI reading room.  The system records an
e-mail address when materials go out, and it sends overdue notices to
that address.  When this new system was put into operation, it was
discovered to be alarmingly effective.  The number of checked-out
materials that were returned was so large that the reading room had to
order additional shelves--it had previously been depending on patrons
to hold a substantial part of its collection.  (This system is not a
main-line component of the research project; we did it to maintain the
ability to use the LCS/AI reading room as a trial venue for main-line
projects.)

B.5.  Signatures as names.  A current hypothesis of our replication
research is that the underlying stored objects should be thought of as
immutable.  Immutability gives the opportunity to use a long checksum
of the data itself as the unique name of the object, for example as
the opaque string that appears in its URN.  Mitchell Charity developed
an experimental system that names objects in this way.  In addition
this experimental system provides a testbed in which we can launch
various bit-rotting demons and watch how well the detection and repair
algorithms cope.

B.6.  Navigation system.  Andrew Kass completely overhauled and
upgraded the data management service that underlies our prototype
CS-TR navigation system that is implemented on top of the Domain Name
Service.  There are currently two navigation services in operation.
Each one takes a CSTR ID as its input argument:

1.  ID.url.ltt-ns.lcs.mit.edu returns a URL to the place that images
of the specified document can be found, if images exist.

2.  ID.formats.ltt-ns.lcs.mit.edu returns a list of formats in which
the document is available.

The navigation management system now polls the five CS-TR sites for
new documents once a week.  If a particular site is down at the time
of the poll, it keeps the previous week's data for that site and skips
updating for that week.  Retaining the old data also happens if a
server suddenly "loses" many documents; that is, if the new list of
documents is significantly smaller than the previous list.


C.  Papers, Theses, Talks, and Meetings

Papers:

Marilyn McMillan and Greg Anderson.  "The Prototyping Tank at MIT:
Come on in, the Water's fine", CAUSE/EFFECT, vol. 17, number 3 (Fall
1994), 51-54.

Meetings.

Greg Anderson and Mitchell Charity attended a meeting of the CS-TR
project participants at CNRI in Reston, Virginia, on August 16, 1994.

Jerry Saltzer visited Stanford University on July 14, 1994, where he
met with Hector Garcia-Molina, Rebecca Lasher, and other Stanford
participants to discuss the CS-TR project.

During this quarter, the CS-TR Interoperability Task Force began
holding regular weekly teleconferences.  This group consists of John
Garrett, CNRI, Michael Stonebraker, UC/Berkeley, Jim Davis, Cornell,
Jerry Saltzer, MIT, and Bill Arms, CMU, moderator; Hector
Garcia-Molina, Stanford, joined the group in August.  During the
quarter progress was made on defining areas of agreement on
interoperability, starting to define an interchange data model, and on
developing a reference architecture.  The weekly teleconferences are
continuing.
Return to Library 2000 home page.