To: Bob Kahn, CNRI
Subject: Quarterly Progress report, April 1--June 30, 1994
Delivery of the hardware under a grant from the IBM Corporation was completed, with the arrival of four RS/6000 220 desktop workstations, to be used by students and staff for program development.
Document Services took delivery of a second scanner, a Fujitsu 3096G. This device is nominally capable of much faster scanning than the previously acquired Hewlett-Packard scanner. However, as of the end of the quarter a number of hardware and software compatibility problems had emerged and were not yet resolved. The two most important problems are that the Fujitsu's white-level calibration changes when switching from flatbed to automatic document feed and transmission of scanned data to the controlling computer is substantially slower than the scanning rate. We are in contact with the vendors of both the hardware and software to resolve these problems, and are also investigating alternate suppliers of scanning software.
The controlling computer, a Quadra 840av, was enhanced by adding a second SCSI channel, an FWB Jackhammer card. The large-capacity disk drives and the DAT drive use this high-performance channel, while the two scanners use the native SCSI channel. Together with software improvements from Apple, overlapped I/O operations among disks and between disks and scanners are now possible.
The disk SCSI chain was shortened by using special "stackits" designed by APS Technologies for use with their stacked drive enclosures thus minimizing the potential for noise interference. Software upgrades and enhancements were done by Jerry Saltzer and Andrew Kass to correct memory usage problems and provide other system improvements.
For the production scanning operation, the group established several scanning targets (supplementary objects to be scanned in addition to the pages of a document). These include technical calibration and quality control targets, blank page replacements, and funding acknowledgement and scanning agent identification. A text file to describe scanning conditions will also be created for each scanned document. Jerry Saltzer proposed criteria for content of this file for review by the group.
Testing of workflow and effectiveness of current software tools was performed by Document Services personnel in conjunction with the staff at LCS. Methods for file compression, checksum for file integrity and file transmission were tested. Backup to DAT tape of image files was also done to test and familiarize Document Services personnel with the procedures and tools to be employed for archiving.
By the end of the quarter, a total of about 40 technical reports, totalling 3900 pages and nearly 40 Gigabytes of raw data, had been scanned at 300 dpi with 8-bit greyscale, starting with second-generation originals. As experience is gained and the compatibility problems are resolved, a step up in scanning resolution to 400 dpi, 8-bit, and to use of first-generation originals should occur some time early in the next quarter.
In a related but slightly different area, following suggestions of the M. I. T. intellectual property counsel and in consultation with the publications office of the Laboratory for Computer Science, we developed a one-page alerting notice and permission-granting form for authors of technical reports. This notice points out the implications of electronic distribution, reminds authors of their rights and responsibilities concering intellectual property, and secures their permission to publish and distribute their technical report both on paper and electronically. The publications office of LCS began using this notice for newly-published technical reports in May, 1994; there is currently no intent to track down authors of previously-published technical reports and ask them to read and sign the form.
B.1. Data integrity checking. Komkit Tukovinit completed an Advanced
Undergraduate Project (under a recent revision of the M.I.T. EECS
department curriculum, the Advanced Undergraduate Project replaces the
undergraduate thesis) to begin developing techniques for integrity checking
to help insure long-term survival of massive quantities of on-line data.
The basic survival concept we are exploring is to create geographically
separated complete replicas of the data. Most research on replication
concentrates on atomic, reliable update rather than on long-term survival
of rarely updated data. The most common operations on library data are to
read and to add new data; the rate of intentional modification is probably
less than the rate of accidental bit decay, so a different emphasis in
design seems to be required.
Our current thinking on the design of this survival system is that it
should be organized in independent components. A first component is a data
reader, associated with each replica, that continually compares that
replica's current data with a record of yesterday's data. A second
component is a set of cross-checkers that compare the replicas. A third
component performs any needed repairs to the data, based on the discoveries
of the first two. The objective of the project is to create a design for
these components that is so straightforwardly simple that one can make a
credible claim that it will work correctly. Tukovinit documented this
multi-component model, and Yoav Yerushalmi and Mitchell Charity are now
refining and prototyping the model. A major point of departure (and
simplification) of the model they are currently working with is that a
storage server should be immutable except for addition of new files of
data.
B.2 Minimal Perfect Hashing. Yuk Ho completed a Master of Engineering
thesis exploring the idea of using minimal perfect hashing to speed up
indexing in large RAM indexes. With standard hash indexing, several
different keys may hash to the same index position, and some mechanism
(buckets, open-addressing, etc.) that involves multiple probes of the index
is needed to separate them. Perfect hashing is a technique for
precomputing a hash function that, for a given set of keys, produces a
unique index for each different key. With minimal perfect hashing, the
index set is, in addition, compact. The advantage of using minimal perfect
hashing, then, is that one need probe the hash table only once for any
lookup. This idea has been explored theoretically, and some experimental
implementations have been done, but it has not, as far as we can tell, been
tried in practice. Ho located one of these experimental implementations,
upgraded it to production quality, installed it in the search engine of the
Library 2000 testbed system, and did a series of performance measurements.
His conclusion was disappointing: the time required to invoke the
precomputed hash function was comparable to or greater than the time
required to perform multiple probes in an ordinary hash table, so there was
essentially no performance advantage to perfect hashing, at least within
the range of hardware configurations that we anticipate will be useful in a
library. A full report on this result appears in his M.Eng. thesis.
B.3. URN to URL via DNS. Ali Alavi completed a Master of Engineering
thesis that explored the following idea: rather than adding a RETRIEVAL
field to the bibliographic record, consider the ID field to be a universal
name for the desired document, and present that name to a network name
service, which would return the temporally fragile information needed for
retrieval. With help from Mitchell Charity, Alavi designed and implemented
this system using the Internet Domain Name Service (DNS) to associate each
ID as defined in RFC-1357 with an Internet Uniform Resource Locator (URL),
a standard form for retrieval information that was developed elsewhere for
the World-Wide Web. He primed a DNS server with a set of ID-URL pairs for
several thousand currently on-line technical reports of the five
universities in the CS-TR group.
To demonstrate the usefulness of this approach, Alavi's thesis went one
step further, modifying the user interface of the Library 2000 catalog
system so that whenever a search result includes a technical report,the
system submits the ID found in the catalog to the DNS ID-URL server. If the
server returns a URL the modified user interface pops up an extra active
button on the user's screen, giving the user the ability to retrieve the
on-line version of the report.
This demonstration was successful, and it is now the basis for discussions
with the CSTR group on the relation between ID's and "handles," a document
registration system proposed by CNRI. We are also monitoring the work of
the Internet Engineering Task Force on proposals for Uniform Resource Names
and Uniform Resource Characteristics, two closely-related ideas.
B.4. Citation lookup. Eytan Adar developed a program to take technical
report citations as found in typical published documents and precisely
identify the corresponding documents by performing a catalog search. This
problem, which seems easy at first, is severely complicated by the
existence of varying citation styles, multiple forms for author and
publisher names, incorrect and incomplete citations, and other similar
real-world variations. As a result, heuristics seem to be an essential
component of any automated lookup process. Adar's program selects several
independently random sets of words from the original citation, performs a
catalog lookup for each such set, and compares the results. It presents to
the user the document or documents that matched several of the independent
sets. The system has proven to be surprisingly effective for such a simple
heuristic. When applied to citations of technical reports known to be in
the reading room catalog, for about 98% of the citations it presents a list
of candidates in which the correct report is the top choice. In half the
remaining citations, the correct report is in the list, but not at the top.
This leaves a residue of about 1% of citations in which the algorithm
fails to correctly discover the catalog entry.
B.5 Text-Image maps.
Adar's citation lookup system is especially interesting when coupled with
the text-image map demonstrated this quarter by Jeremy Hylton. This map
relates scanned images to the corresponding ASCII text. The goal here is
to allow two symmetric features. In one direction the user selects a region
of interest in a scanned image that contains, for example, a citation to
another document or a page number. The image-text map allows the system to
identify the ASCII text that corresponds to the selected region. In the
case of a citation, the system can then invoke Adar's citation lookup
program to locate the cited document. Or, if the selected region contains a
page number, the system can, using the scanned document description record
mentioned earlier, identify which page image corresponds to that page
number and display it.
In the other direction, when a user initiates a catalog search using
particular words or phrases and asks to display a document that contains
them, the text-image map allows the system to locate the appropriate page
of that document and highlight the region of the image that contains the
words or phrases of the original search.
Hylton worked out methods of representing text-to-bounding box mapping
information and did initial demonstrations of both page number lookup and
citation search. He also looked into exploiting a feature of some OCR
software, such as Xerox ScanWorX, that returns not just ASCII text but also
the coordinates of the bounding boxes in which it found the text. To
develop the demonstration he attempted to utilize the IMAGEMAP extension of
HTML (one of the building blocks of the World-Wide Web) and in doing so
identified limitations and a proposal for improvement in mouse-handling
that became the subject of a position paper for the World-Wide Web
conference in Geneva in June.
B.6. Fax Service. Using the Internet Remote-printing project, which
allows free faxing over the Internet, Eytan Adar wrote a Mosaic interface
to a script that would allow us to fax technical reports. However,
availability of this interface was placed on hold as it has become apparent
that the remote-printing project is not quite ready to handle the documents
we produce (in terms of size).
B.7 WAIS Service. Eytan ADAR set up a WAIS server that contains a
database driven from RFC1357 bibliographic records and that converts the
records into somthing that "looks good" in World-Wide Web viewers. He set
up the server to automatically update itself every day from the accessible
bibliographic records of the five members of the CS-TR group. (A related
program was also written that can convert RFC1357 bibliographic records
into BibTeX and Bibrefer formats.)
B.8 World-Wide Web Demonstration area. Jeremy Hylton created a
World-Wide-Web home page, organized older materials of the research group,
and inspired the other members of the group to create a collection under
that home page that provides a showplace and distribution center for our
research work. The Library 2000 home page can be found by following the
path M.I.T. --> Laboratory for Computer Science --> Library 2000 group.
Charity, Mitchell N. "Multiple Standards? No Problem." Proceedings of
Digital Libraries '94 (June 19-21, 1994) College Station, Texas, pages
203-204.
Yuk Ho. Application of Minimal Perfect Hashing in Main Memory Indexing.
Master of Engineering thesis, June, 1994.
Kom Tukonovit. Fault Detection in L2000 Storage-Replication Service.
Advanced Undergraduate Project, May, 1994.
Keith Glavash, Mitchell Charity, and Jack Eisan visited Cornell University
on May 3 and 4, 1994, to meet with the participants there in the CS-TR
project as well as other staff involved with various scanning projects on
Cornell campus, and to gather comments on their experience with the work
process and tools employed.
B. Technical/Research
This quarter, three new students joined the group, and three previous
student members completed academic theses. Several other projects got to
the point of demonstration.C. Papers, Theses, and Talks
Papers
Hylton, Jeremy. "Text-Image Maps for Document Delivery on the Web." First
International Conference on the World-Wide Web (May 25-27, 1994) CERN,
Geneva.Theses
Ali Alavi. The Use of the Domain Name System for Dynamic References in an
Online Library. Master of Engineering thesis. May, 1994.Talks
Anderson, T. Gregory. "The Computer Science Technical Report Project."
Coalition for Networked Information meeting, Washington, D.C., April 5,
1994.D. Meetings.
Greg Anderson, Jerry Saltzer, and Mitchell Charity attended a 1-day meeting
of the CS-TR project participants at Carnegie-Mellon University in
Pittsburgh on May 19, 1994.