Library 2000 Quarterly Progress Report

From:        MIT component of CS-TR project
             (Jerry Saltzer and Greg Anderson)

To:          Bob Kahn, CNRI

Subject:     Quarterly Progress report, January 1--March 31, 1994

A.  Administrative
B.  Technical/research
C.  Talks and papers
D.  Meetings

------------------------------

A.  Mostly "administrative" things, relating to startup of on-line
technical report services:

The previously reported difficulties in administration of our hardware
grant from IBM have been resolved, and all three of our RS/6000
servers have now been upgraded so that each contains 0.5 Gigabytes of
RAM, to allow us to experiment with large RAM indexes and other ways
of exploiting large RAM.  The full-text index for the entire M. I. T.
library catalog is now resident in one of these servers, and another
is being used in an M. Eng. thesis on alternative hashing techniques
started this quarter and reported below.

We completed acquisition of the scanning system described in the
previous quarterly report, installed it in the Document Services
facility of the M.I.T. Library, and began to work out the kinks in
scanning work flow.  By the close of the quarter we had established a
schedule of regular scanning of Technical Reports, and seven LCS TR's,
totalling about 1000 pages, had been scanned.  Although our scanner is
capable of 400 dpi, the software package that provides appropriate
workflow control believes it is working with an older, 300 dpi
scanner, and sets it accordingly.  This incompatibility will be
resolved as soon as we acquire our production scanner, for which we
have obtained quotes and expect soon to request purchase approval.

Our overall scanning strategy is to extract the maximum amount of
information possible with the available hardware.  At the current
time, this means scanning at 400 dpi, 8-bits (grey-scale) per pixel.
This decision leads to very large images, about 16 Mbytes per page,
and a need to carefully organize storage and workflow; at the current
rate of scanning, we are acquiring about 1 Gigabyte of data per day.
Our approach is immediately to compute a checksum signature of the raw
data, archive that raw data to tape, and then to (reversibly) compress
and transfer it out of the scanning station to a server site for
additional processing before storing it for distribution.  Geoff Lee
Seyon, a new undergraduate member of our group, identified software
suitable for integrity-checking and transmission compression of
scanned images.  His primary criteria were that the software be
off-the-shelf, be capable of doing many files with a single command in
the Macintosh environment, and have implementations both on the
Macintosh and UNIX.

We are also exploring various related issues such as how to go about
determining optimum brightness and contrast settings, how to best do
quality control, how often test targets should be run through the
system to maintain quality.  In connection with the need for test
targets, we are obtaining a set of such targets and technical and
standards related materials for scanning from the Association for
Information and Image Management (AIIM).

Technical work on design of the scanning station has been done for the
most part by Jack Eisan, of Document Services, and Jeremy Hylton, an
undergraduate member of the Library 2000 research group.  To
coordinate the operational aspects of this activity, a regular series
of weekly meetings of the Document Services staff together with its
management now reviews progress and operational problems.

During the quarter, we acquired about 14 Gigabytes of magnetic disk
memory.  Those disks have been installed on an RS/6000 storage server
acquired to jointly serve the CS-TR project and the TULIP project,
another image-delivery experiment in which we are jointly
participating.  That server is now in operation.

Mitchell Charity turned our storage service FTP site into production,
providing bibliographic records for all LCS/AI technical reports,
authoritative PostScript files for some of the most recent of those
reports, and providing a place from which to distribute the scanned
images that are beginning to flow from the Document Services scanning
system.  This storage service also provides a mirror site for the
bibliographic records of the other four universities working on the
CS-TR project.  One feature of the mirror site is a picky syntax
checker, which makes a list of syntax errors found in bibliographic
records from other sites.  (It is, of course, not necessary to run the
picky checker on our own records.)

Jeremy Hylton created a World-Wide-Web home page and organized the
rest of the research group to create a collection of materials under
that home page that provide a showplace and distribution center for
our research work.


B.  Technical/Research

This quarter, several students, ranging from first-year through
Master's level, joined the group, and launched quite a range of
projects.

B.1.  Data integrity checking.  Kom Tukonovit began an Advanced
Undergraduate Project (under the latest revision of the MIT EECS
department curriculum, the Advanced Undergraduate Project replaces the
old undergraduate thesis) to develop techniques for integrity checking
to help insure long-term survival of massive quantities of on-line
data.  The basic survival concept we are exploring is to create
geographically separated complete replicas of the data.  Most research
on replication concentrates on atomic, reliable update rather than on
long-term survival of rarely updated data.  The most common operations
on library data are to read and to add new data, and the rate of
intentional modification is probably less than the rate of accidental
bit decay, so a different emphasis is required.

The first component of the survival system is a data reader,
associated with each replica, that continually compares that replica's
current data with a record of yesterday's data.  The second component
is a cross-checker that compares the replicas.  The third component is
a repair-list generator.  The objective of the project is to create a
design for these three components that is so straightforwardly simple
that one can make a credible claim that it will work correctly.

B.2.  Minimal perfect hashing.  Yuk Ho began a Master of Engineering
thesis exploring the idea of using minimal perfect hashing to speed up
indexing in large RAM indexes.  With standard hash-indexing, several
different keys may hash to the same index position, and some mechanism
(buckets, open-addressing, etc.) is needed to separate them.  Perfect
hashing is a technique for precomputing a hash-function that, for a
given set of keys, produces a unique index for each different key.
With minimal perfect hashing, the index set is, in addition, compact.
This is a topic that has been explored theoretically, and some toy
implementations have been done, but the idea has not, as far as we can
tell, been tried in practice.  During the quarter, Yuk located the
literature on the topic, identified a good implementation, integrated
a copy of it into the Library 2000 search engine, and began
performance analysis.  A report on the result will appear in his
M.Eng. thesis.

B.3.  URN to URL via DNS.  Ali Alavi continued to work on a Master of
Engineering thesis exploring the idea that the Internet Domain Name
System (DNS) might be a useful tool for relating permanent document
identifiers (in discussions on the Internet, these identifiers are
sometimes called Universal Resource Names, or URN's) with location
information (termed Universal Resource Locators, or URL's).  His
thesis will include embedding a practical implementation in the
Library 2000 testbed system.

B.4.  Text-Image maps.  Jeremy Hylton began a project to relate
scanned images to the corresponding ASCII text.  The goal here is to
allow two symmetric features: in one direction the user selects a
region of interest in a scanned image that contains, for example, a
reference or a page number.  The image-text map allows the system to
identify the ASCII text that corresponds to the selected region, and
thence look up the reference or turn to the indicated page.  In the
other direction, a user initiates a catalog search and finds a
document; the text-image map allows the system to locate the
appropriate page of the document and highlight the region of the image
that contains the words or phrases in the original search.  During
this quarter, Jeremy worked out methods of generating bounding box
information and did an initial demonstration.  He also looked into
exploiting a feature of some OCR software, such as Xerox ScanWorX,
that returns not just ASCII text but also the coordinates of the
bounding boxes in which it found the text.

B.5.  Automatic citation lookup.  Undergraduate Eytan Adar began a
project to take citations as found in typical documents and identify
the corresponding documents by performing a catalog search.  This
problem, which seems easy at first, is severely complicated by the
existence of varying citation styles, multiple forms for author and
publisher names, incorrect and incomplete citations, and other,
similar, real-world variations.  As a result, heuristics seem to be an
essential component of any automated lookup process.  During this
quarter, Eytan put together and tested an initial set of heuristics,
and he is now engaged in refinement to improve their precision and
recall.

B.6. Document browsing.  Undergraduate Andrew Kass started design of a
modular widget to allow easily embedding document image browsing in
applications that use the X Window System.  In addition, Bill Cattey,
of MIT Information Systems, has taken over development of a portable
high-performance document image browser and has integrated it with the
University of Washington's Willow Z39.50 search client.

B.7.  Image flow and handling.  Jeremy Hylton wrote a utility to
convert 300 dpi gray-scale images to 600 dpi monochrome form for
printing.  This utility is one of several needed at various processing
points for images.  The basic issue being explored here is that
different image transformations are needed for printing, display on
various output devices, and for long-term storage.  To speed up any
particular display situation, one can do precomputation of the
required form, but holding the results until they are needed requires 
storage and precomputation takes time that is wasted if the result is
not actually used.  The tradeoffs are complex and constantly changing
with technology.

B.8 Image Descriptors.  It is apparent that when scanning a document,
one should capture more than just the bits of the image, but it is not
yet clear exactly how much more.  Information such as the resolution
of the original scanner is needed to display the images correctly.
Information such as that image 5 is a color scan and image 6 is a
monochrome scan of the same page is needed to interpret the set of
images properly.  Information about pagination and position of blank
pages is needed for a browser to display the images in the proper
order when a user says "next page".  Scans of standard test targets
perhaps should also be added to the set of images of a document to
allow calibration of output devices.  During the quarter we began to
design a scanned image descriptor, whose purpose is to capture these
various categories of information in a standard, machine-readable
form.

C.  Talks and Papers

None applicable this quarter.


D.  Meetings.

Greg Anderson and Mitchell Charity attended a 1.5 day meeting of the
CS-TR project participants at Stanford University in Palo Alto,
California, on February 10 and 11.  At this meeting, Greg presented a
list of issues from the perspective of the librarian, and thereby
launched a series of discussions on librarianship aspects of the CS-TR
project.

Mitchell Charity, Greg Anderson, and Jerry Saltzer participated in a
two-hour PictureTel conference with the other CS-TR project
participants on January 6.
Return to Library 2000 home page.