Library 2000 Quarterly Progress Report

April 1 -- June 30, 1994

From: MIT component of CS-TR project
(Jerry Saltzer and Greg Anderson)

To: Bob Kahn, CNRI

Subject: Quarterly Progress report, April 1--June 30, 1994

A. Administrative

B. Technical/research

C. Talks and papers

D. Meetings

A. Administrative

Mostly "administrative" things, relating to startup of on-line technical report services:

Delivery of the hardware under a grant from the IBM Corporation was completed, with the arrival of four RS/6000 220 desktop workstations, to be used by students and staff for program development.

Document Services took delivery of a second scanner, a Fujitsu 3096G. This device is nominally capable of much faster scanning than the previously acquired Hewlett-Packard scanner. However, as of the end of the quarter a number of hardware and software compatibility problems had emerged and were not yet resolved. The two most important problems are that the Fujitsu's white-level calibration changes when switching from flatbed to automatic document feed and transmission of scanned data to the controlling computer is substantially slower than the scanning rate. We are in contact with the vendors of both the hardware and software to resolve these problems, and are also investigating alternate suppliers of scanning software.

The controlling computer, a Quadra 840av, was enhanced by adding a second SCSI channel, an FWB Jackhammer card. The large-capacity disk drives and the DAT drive use this high-performance channel, while the two scanners use the native SCSI channel. Together with software improvements from Apple, overlapped I/O operations among disks and between disks and scanners are now possible.

The disk SCSI chain was shortened by using special "stackits" designed by APS Technologies for use with their stacked drive enclosures thus minimizing the potential for noise interference. Software upgrades and enhancements were done by Jerry Saltzer and Andrew Kass to correct memory usage problems and provide other system improvements.

For the production scanning operation, the group established several scanning targets (supplementary objects to be scanned in addition to the pages of a document). These include technical calibration and quality control targets, blank page replacements, and funding acknowledgement and scanning agent identification. A text file to describe scanning conditions will also be created for each scanned document. Jerry Saltzer proposed criteria for content of this file for review by the group.

Testing of workflow and effectiveness of current software tools was performed by Document Services personnel in conjunction with the staff at LCS. Methods for file compression, checksum for file integrity and file transmission were tested. Backup to DAT tape of image files was also done to test and familiarize Document Services personnel with the procedures and tools to be employed for archiving.

By the end of the quarter, a total of about 40 technical reports, totalling 3900 pages and nearly 40 Gigabytes of raw data, had been scanned at 300 dpi with 8-bit greyscale, starting with second-generation originals. As experience is gained and the compatibility problems are resolved, a step up in scanning resolution to 400 dpi, 8-bit, and to use of first-generation originals should occur some time early in the next quarter.

In a related but slightly different area, following suggestions of the M. I. T. intellectual property counsel and in consultation with the publications office of the Laboratory for Computer Science, we developed a one-page alerting notice and permission-granting form for authors of technical reports. This notice points out the implications of electronic distribution, reminds authors of their rights and responsibilities concering intellectual property, and secures their permission to publish and distribute their technical report both on paper and electronically. The publications office of LCS began using this notice for newly-published technical reports in May, 1994; there is currently no intent to track down authors of previously-published technical reports and ask them to read and sign the form.

B. Technical/Research

This quarter, three new students joined the group, and three previous student members completed academic theses. Several other projects got to the point of demonstration.

B.1. Data integrity checking. Komkit Tukovinit completed an Advanced Undergraduate Project (under a recent revision of the M.I.T. EECS department curriculum, the Advanced Undergraduate Project replaces the undergraduate thesis) to begin developing techniques for integrity checking to help insure long-term survival of massive quantities of on-line data. The basic survival concept we are exploring is to create geographically separated complete replicas of the data. Most research on replication concentrates on atomic, reliable update rather than on long-term survival of rarely updated data. The most common operations on library data are to read and to add new data; the rate of intentional modification is probably less than the rate of accidental bit decay, so a different emphasis in design seems to be required.

Our current thinking on the design of this survival system is that it should be organized in independent components. A first component is a data reader, associated with each replica, that continually compares that replica's current data with a record of yesterday's data. A second component is a set of cross-checkers that compare the replicas. A third component performs any needed repairs to the data, based on the discoveries of the first two. The objective of the project is to create a design for these components that is so straightforwardly simple that one can make a credible claim that it will work correctly. Tukovinit documented this multi-component model, and Yoav Yerushalmi and Mitchell Charity are now refining and prototyping the model. A major point of departure (and simplification) of the model they are currently working with is that a storage server should be immutable except for addition of new files of data.

B.2 Minimal Perfect Hashing. Yuk Ho completed a Master of Engineering thesis exploring the idea of using minimal perfect hashing to speed up indexing in large RAM indexes. With standard hash indexing, several different keys may hash to the same index position, and some mechanism (buckets, open-addressing, etc.) that involves multiple probes of the index is needed to separate them. Perfect hashing is a technique for precomputing a hash function that, for a given set of keys, produces a unique index for each different key. With minimal perfect hashing, the index set is, in addition, compact. The advantage of using minimal perfect hashing, then, is that one need probe the hash table only once for any lookup. This idea has been explored theoretically, and some experimental implementations have been done, but it has not, as far as we can tell, been tried in practice. Ho located one of these experimental implementations, upgraded it to production quality, installed it in the search engine of the Library 2000 testbed system, and did a series of performance measurements. His conclusion was disappointing: the time required to invoke the precomputed hash function was comparable to or greater than the time required to perform multiple probes in an ordinary hash table, so there was essentially no performance advantage to perfect hashing, at least within the range of hardware configurations that we anticipate will be useful in a library. A full report on this result appears in his M.Eng. thesis.

B.3. URN to URL via DNS. Ali Alavi completed a Master of Engineering thesis that explored the following idea: rather than adding a RETRIEVAL field to the bibliographic record, consider the ID field to be a universal name for the desired document, and present that name to a network name service, which would return the temporally fragile information needed for retrieval. With help from Mitchell Charity, Alavi designed and implemented this system using the Internet Domain Name Service (DNS) to associate each ID as defined in RFC-1357 with an Internet Uniform Resource Locator (URL), a standard form for retrieval information that was developed elsewhere for the World-Wide Web. He primed a DNS server with a set of ID-URL pairs for several thousand currently on-line technical reports of the five universities in the CS-TR group.

To demonstrate the usefulness of this approach, Alavi's thesis went one step further, modifying the user interface of the Library 2000 catalog system so that whenever a search result includes a technical report,the system submits the ID found in the catalog to the DNS ID-URL server. If the server returns a URL the modified user interface pops up an extra active button on the user's screen, giving the user the ability to retrieve the on-line version of the report.

This demonstration was successful, and it is now the basis for discussions with the CSTR group on the relation between ID's and "handles," a document registration system proposed by CNRI. We are also monitoring the work of the Internet Engineering Task Force on proposals for Uniform Resource Names and Uniform Resource Characteristics, two closely-related ideas.

B.4. Citation lookup. Eytan Adar developed a program to take technical report citations as found in typical published documents and precisely identify the corresponding documents by performing a catalog search. This problem, which seems easy at first, is severely complicated by the existence of varying citation styles, multiple forms for author and publisher names, incorrect and incomplete citations, and other similar real-world variations. As a result, heuristics seem to be an essential component of any automated lookup process. Adar's program selects several independently random sets of words from the original citation, performs a catalog lookup for each such set, and compares the results. It presents to the user the document or documents that matched several of the independent sets. The system has proven to be surprisingly effective for such a simple heuristic. When applied to citations of technical reports known to be in the reading room catalog, for about 98% of the citations it presents a list of candidates in which the correct report is the top choice. In half the remaining citations, the correct report is in the list, but not at the top. This leaves a residue of about 1% of citations in which the algorithm fails to correctly discover the catalog entry.

B.5 Text-Image maps. Adar's citation lookup system is especially interesting when coupled with the text-image map demonstrated this quarter by Jeremy Hylton. This map relates scanned images to the corresponding ASCII text. The goal here is to allow two symmetric features. In one direction the user selects a region of interest in a scanned image that contains, for example, a citation to another document or a page number. The image-text map allows the system to identify the ASCII text that corresponds to the selected region. In the case of a citation, the system can then invoke Adar's citation lookup program to locate the cited document. Or, if the selected region contains a page number, the system can, using the scanned document description record mentioned earlier, identify which page image corresponds to that page number and display it.

In the other direction, when a user initiates a catalog search using particular words or phrases and asks to display a document that contains them, the text-image map allows the system to locate the appropriate page of that document and highlight the region of the image that contains the words or phrases of the original search.

Hylton worked out methods of representing text-to-bounding box mapping information and did initial demonstrations of both page number lookup and citation search. He also looked into exploiting a feature of some OCR software, such as Xerox ScanWorX, that returns not just ASCII text but also the coordinates of the bounding boxes in which it found the text. To develop the demonstration he attempted to utilize the IMAGEMAP extension of HTML (one of the building blocks of the World-Wide Web) and in doing so identified limitations and a proposal for improvement in mouse-handling that became the subject of a position paper for the World-Wide Web conference in Geneva in June.

B.6. Fax Service. Using the Internet Remote-printing project, which allows free faxing over the Internet, Eytan Adar wrote a Mosaic interface to a script that would allow us to fax technical reports. However, availability of this interface was placed on hold as it has become apparent that the remote-printing project is not quite ready to handle the documents we produce (in terms of size).

B.7 WAIS Service. Eytan ADAR set up a WAIS server that contains a database driven from RFC1357 bibliographic records and that converts the records into somthing that "looks good" in World-Wide Web viewers. He set up the server to automatically update itself every day from the accessible bibliographic records of the five members of the CS-TR group. (A related program was also written that can convert RFC1357 bibliographic records into BibTeX and Bibrefer formats.)

B.8 World-Wide Web Demonstration area. Jeremy Hylton created a World-Wide-Web home page, organized older materials of the research group, and inspired the other members of the group to create a collection under that home page that provides a showplace and distribution center for our research work. The Library 2000 home page can be found by following the path M.I.T. --> Laboratory for Computer Science --> Library 2000 group.

C. Papers, Theses, and Talks

Papers

Hylton, Jeremy. "Text-Image Maps for Document Delivery on the Web." First International Conference on the World-Wide Web (May 25-27, 1994) CERN, Geneva.

Charity, Mitchell N. "Multiple Standards? No Problem." Proceedings of Digital Libraries '94 (June 19-21, 1994) College Station, Texas, pages 203-204.

Theses

Ali Alavi. The Use of the Domain Name System for Dynamic References in an Online Library. Master of Engineering thesis. May, 1994.

Yuk Ho. Application of Minimal Perfect Hashing in Main Memory Indexing. Master of Engineering thesis, June, 1994.

Kom Tukonovit. Fault Detection in L2000 Storage-Replication Service. Advanced Undergraduate Project, May, 1994.

Talks

Anderson, T. Gregory. "The Computer Science Technical Report Project." Coalition for Networked Information meeting, Washington, D.C., April 5, 1994.

D. Meetings.

Greg Anderson, Jerry Saltzer, and Mitchell Charity attended a 1-day meeting of the CS-TR project participants at Carnegie-Mellon University in Pittsburgh on May 19, 1994.

Keith Glavash, Mitchell Charity, and Jack Eisan visited Cornell University on May 3 and 4, 1994, to meet with the participants there in the CS-TR project as well as other staff involved with various scanning projects on Cornell campus, and to gather comments on their experience with the work process and tools employed.

Return to Library 2000 home page.