Library 2000 Quarterly Progress Report

From:     M. I. T. component of CS-TR project
          (Jerome H. Saltzer and T. Gregory Anderson)

To:       Robert E. Kahn, CNRI

Subject:  Quarterly Progress report, April 1--June 30, 1993

A.  Administrative
B.  Technical/research
C.  Talks and papers

------------------------------

A.  Mostly "administrative" things, relating to startup of on-line
technical report services:

1.  We have continued negotiations with CNRI on details of the contract. 
The primarily topic of discussions has been the intellectual property
issues of distributing technical reports on-line.

2.  Upon invitation, we prepared a proposal to DEC for hardware to build a
triply-replicated storage server, initially with 20 Gbytes of disk memory. 
(We have recently heard that the funds to grant this proposal may not be
available.)

3.  We ordered hardware under a grant from IBM to expand the RAM of our
three RS-6000 index servers to the maximum that can be configured, 512
Mbytes.  We also ordered a processor upgrade for the two smaller
(RS-6000/520) machines.  This equipment is expected to be installed during
the next quarter.  The grant available was slightly short of the amount
needed for the full complement of memory, so we held back a portion of the
funds in hope of a price reduction (which materialized shortly after the
end of the quarter!).

4.  We identified a suitable path to print bit-mapped images moderately
quickly (30 seconds/page) on the latest generation of HP laser printers.

5.  We designed and specified a scanning station, and workflow for scanned
images, with bottleneck analysis.   We have received the first pass of
quotes from vendors, reviewed them, identified key elements and criteria
for a revised quote, and prepared a request for revised quotes.  We are
currently waiting for contract signature with CNRI, after which we will
submit the configuration for authorization to proceed with procurement.

6.  The MicroReproduction Laboratory has now received the first group of 85
TRs from LCS, and has logged them in, tracking:  TR#, author(s), date,
pages, photographic content, and copyright information.  The log will also
identify missing TRs which have to be dug up elsewhere.

7.  The MicroReproduction laboratory has initiated weekly departmental
meetings to discuss the project.






------------------------------

B.  Mostly technical/research.

1.  In the course of designing the scanning station, we uncovered a large
number of interesting system organization issues:

Technical report images may be obtained from several sources (archive
or shelf copies, guillotined cut pages, microfilm or fiche, PostScript
source), and may involve any of a large number of exceptions (color,
photographs, oversize sheets and foldouts, yellowing paper and fading
ink).  Scanning can be done at various resolutions (300, 600, 1200
pixels per inch), at various depths (black & white, gray-scale,
color), and with various adjustments (brightness, contrast, linearity,
threshold).  In a production setting, the resulting images need to be
aligned, cropped, cleaned up, quality checked, labeled and archived.
Finally, each distinct prospective use of an image (storage, screen
display, printing, optical character recognition) seems to require
distinct processing of the image bits.  This latter observation leads
to many possible architectural variations involving tradeoffs between
on-the-fly conversion and storage of multiple forms of the same image.

One of the single biggest problems is that the resulting data is
rather voluminous, and it is easier to think up work flows that
contain bottlenecks than ones that do not.  The initial data from a
good quality grey-scale-capable scanner consists of about 10
MBytes/page, or about 100 Kbytes/page after compression.  The scanner
can read 40 pages/minute, yielding 500 MBytes/minute of raw data.  For
comparison, SCSI-2 drives can read or write 240 MBytes/minute, DAT
drives 1 MByte/minute, and Ethernet transfer typically takes place at
about 0.5 MByte/min.  The implication is that one must do some form of
compression very close to the scanner; unfortunately the computer
scanning platform that has been most developed is the Macintosh, but
that is not the easiest environment in which to develop experimental
compression 
prototypes.  Fortunately, a single human operator can apparently
physically handle, track, coordinate, and check quality on only about
400 pages per hour, an order of magnitude less than the mechanical
capability of the scanner.  Thus one can reasonably consider options
such as buffering a day's work on disk and transferring the resulting
data dump over the network or to tape over night.

2.  We began implemention of an X-Mosaic service gateway to the library
2000 search engine.  We began considering various mix-and-match
combinations of the L2000 image browser, the L2000  search interface, the
X-Mosaic image browser, and the X-Mosaic search interface.

3.  We developed a first-draft specification for the interface semantics of
a storage service.  A full description will appear in our annual report.

4.  We have participated in on-line discussions via the CS-TR list on the
subject of management of images, readability, and screen display
capabilities, and obtaining text from images.  (Primarily with Robert
Wilensky.)

5.  Mitchell Charity has been exploring the image retrieval architecture.
This topic includes:
 * overall architecture
What division of labor between publishers and third-party services?
Who provides bibliographic information, awareness service, index service,
and a variety of image, fulltext, and document description information?
How much functionality can be pushed down to the publisher, and how
much should third-parties be expected to provide, expecially in the
short term where economic incentives are lacking?
Where are standards needed for a viable system?  Can third-party services
fill the gaps where standards are unclear? To what extent are function
specifications (ie "it must be possible, indeed straight-forward,
to do X") sufficient?
 * retrieval information
Image resolution, preprocessing and administrative characteristics,
page->image mapping, extended URLs.
 * image transport
Can/should existing protocols be used?
 * image formats and document structure information
Use of multiple resolutions.  Retrieval strategy (clumping, caching, prefetch).

To this end, he has begun:
 - shadowing the Stanford bibliographic records
 - correcting their format errors
 - combining them with retrieval information,
     for both Stanford's archival images and local preprocessed copies,
     over various protocols (ftp,http(world wide web),wais).
 - serving them as a wais database
 - caching images, with format conversion, for a few of their trs
     (A variety of formats have been/are being expermented with.)
 - using these to experiment with a variety of clients,
     including standard, widely available ones.
 This collection of activities illustrates: shadowing a publisher who lacks
an awareness service, value-added processing of bibliographic information
(shades
of UMI discussion), third-party maintainence and service of retrieval
information in a locally specified format, third-party index service
via a random protocol, and third-party image storage service in a variant
format.
 The next step is push some of these functions back onto the publishers.
This can be done where there is a clearly reasonable standard,
and where there are merely plausable standards, but the function is
too expensive for one to expect multiple third-party solutions to exist.


C.  Talks and papers given during the quarter:

Saltzer, J. H.  "Storage, the Unnoticed Revolutation."  Open session
of the Computer Science & Telecommunications Board, National Research
Council, Washington, D. C., May 24, 1993.

Anderson, T. Gregory.  "Electronic Publishing."  Harvard Library
Automation Planning Committee, Harvard University, Cambridge,
Massachusetts, June 4, 1993.
Return to Library 2000 home page.