Library 2000 Quarterly Progress Report

From:     MIT component of CS-TR project
             (Jerry Saltzer and Greg Anderson)

To:          Bob Kahn, CNRI

Subject:  Quarterly Progress report, July 1--Sept 30, 1993

A.  Administrative
B.  Technical/research
C.  Talks and papers

------------------------------

A.  Mostly "administrative" things, relating to startup of on-line
technical report services:

1.  We completed negotiations with CNRI on details of the contract.
M.I.T. signed the contract and we are now making plans to ramp up
activities that have been held in abeyance.

2.  Digital Equipment Corporation did not formally respond to our
proposal for grant hardware.  In informal contacts we learned that
DEC's External Research Program is in considerable disarray and that
we should not expect any help from that source soon.  We have
therefore changed our strategy for acquisition of storage hardware.
We have located one older RS-6000 server that M.I.T. Information
Services may declare surplus and that may be reallocable to our
project.  In addition, we are reducing the scale of our indexing plans
and reallocating one of our indexing servers to storage service.  And
while waiting for contract negotiations to complete, "street" disk
prices have dropped somewhat, so careful expenditure of the
originally-allocated funds may allow us to acquire almost as much
storage capacity as we had originally planned.

3.  We received the first shipment of additional hardware under a
grant from IBM to expand the RAM of our three RS-6000 index servers to
the maximum that can be configured, 512 Mbytes, as well as a processor
upgrade for the two smaller (RS-6000/520) machines.  As hoped, IBM did
announce a memory price reduction, so we were able to obtain the full
configuration needed while leaving a small part of the grant available
for future purchases.  The new hardware, though received fully two
months ago, is not yet installed.  And the second hardware shipment
has not yet arrived.  (IBM is experiencing some of the same internal
turmoil as is DEC; in this case it is showing up in the sales and
customer-support areas.)

5.  Our scanning station design is now almost complete.  Jack Eisan
visited one of the prospective vendors to get a demonstration and a
clearer picture of their product's capabilities.  We have decided to
obtain the scanning station by purchasing components from several
vendors rather than by contracting with a single supplier for a
complete system.  This approach allows more flexibility in choosing
components, and it also makes it easer to do the acquisition in
stages; we expect to learn things in the first stages of acquisition
that clarify which options to follow in later stages.


------------------------------

B.  Mostly technical/research.

1.  In the course of designing the scanning station, we uncovered a
large number of interesting system organization issues:

Technical report images may be obtained from several sources (archive
or shelf copies, guillotined cut pages, microfilm or fiche, PostScript
source), and may involve any of a large number of exceptions (color,
photographs, oversize sheets and foldouts, yellowing paper and fading
ink).  Scanning can be done at various resolutions (300, 600, 1200
pixels per inch), at various depths (black & white, gray-scale,
color), and with various adjustments (brightness, contrast, linearity,
threshold).  In a production setting, the resulting images need to be
aligned, cropped, cleaned up, quality checked, labeled and archived.
Finally, each distinct prospective use of an image (storage, screen
display, printing, optical character recognition) seems to require
distinct processing of the image bits.  This latter observation leads
to many possible architectural variations involving tradeoffs between
on-the-fly conversion and storage of multiple forms of the same image.

One of the single biggest problems is that the resulting data is
rather voluminous, and it is easier to think up work flows that
contain bottlenecks than ones that do not.  The initial data from a
good quality grey-scale-capable scanner consists of about 10
MBytes/page, or about 100 Kbytes/page after compression.  The scanner
can read 40 pages/minute, yielding 500 MBytes/minute of raw data.  For
comparison, SCSI-2 drives can read or write 240 MBytes/minute, DAT
drives 1 MByte/minute, and Ethernet transfer typically takes place at
about 0.5 MByte/min.  The implication is that one must do some form of
compression very close to the scanner; unfortunately the computer
scanning platform that has been most developed is the Macintosh, but
that is not the easiest environment in which to develop experimental
prototypes.  Fortunately, a single human operator can apparently
physically handle, track, coordinate, and check quality on only about
400 pages per hour, an order of magnitude less than the mechanical
capability of the scanner.  Thus one can reasonably consider options
such as buffering a day's work on disk and transferring the resulting
data dump over the network or to tape over night.

2.  We implemented an X-Mosaic service gateway to the library 2000
search engine.  We began considering various mix-and-match
combinations of the L2000 image browser, the L2000 search interface,
the X-Mosaic image browser, and the X-Mosaic search interface.

3.  We developed a first-draft specification for the interface
semantics of a storage service.  A full description will appear in our
annual report.

4.  We have participated in on-line discussions via the CS-TR list on
the subject of management of images, readability, and screen display
capabilities, and obtaining text from images.  (Primarily with Robert
Wilensky.)

5.  Mitchell Charity has been exploring the image retrieval architecture.
This topic includes:
 * overall architecture
What division of labor between publishers and third-party services?
Who provides bibliographic information, awareness service, index service,
and a variety of image, fulltext, and document description information?
How much functionality can be pushed down to the publisher, and how
much should third-parties be expected to provide, expecially in the
short term where economic incentives are lacking?
Where are standards needed for a viable system?  Can third-party services
fill the gaps where standards are unclear? To what extent are function
specifications (ie "it must be possible, indeed straight-forward,
to do X") sufficient?
 * retrieval information
Image resolution, preprocessing and administrative characteristics,
page->image mapping, extended URLs.
 * image transport
Can/should existing protocols be used?
 * image formats and document structure information
Use of multiple resolutions.  Retrieval strategy (clumping, caching, prefetch).

To this end, he has begun:
 - shadowing the Stanford bibliographic records
 - correcting their format errors
 - combining them with retrieval information,
     for both Stanford's archival images and local preprocessed copies,
     over various protocols (ftp,http(world wide web),wais).
 - serving them as a wais database
 - caching images, with format conversion, for a few of their trs
     (A variety of formats have been/are being expermented with.)
 - using these to experiment with a variety of clients,
     including standard, widely available ones.

 This collection of activities illustrates: shadowing a publisher who
lacks an awareness service, value-added processing of bibliographic
information (shades of UMI discussion), third-party maintainence and
service of retrieval information in a locally specified format,
third-party index service via a random protocol, and third-party image
storage service in a variant format.
 The next step is push some of these functions back onto the publishers.
This can be done where there is a clearly reasonable standard,
and where there are merely plausable standards, but the function is
too expensive for one to expect multiple third-party solutions to exist.


C.  Talks and papers given during the quarter:

Saltzer, J. H.  "Storage, the Unnoticed Revolutation."  Open session
of the Computer Science & Telecommunications Board, National Research
Council, Washington, D. C., May 24, 1993.

Anderson, T. Gregory.  "Electronic Publishing."  Harvard Library
Automation Planning Committee, Harvard University, Cambridge,
Massachusetts, June 4, 1993.
Return to Library 2000 home page.