Projects and Demonstrations

The technology of on-line storage, display, and communications will, by the year 2000, make it economically possible to place the entire contents of a library on-line, accessible from computer workstations located anywhere. The goal of Library 2000 is to understand and try out the system engineering required to fabricate such a future library system from the underlying technologies.
--From the Library 2000 Prospectus, October 31, 1991.

Projects and Demonstrations:

Reading room catalog. The first demonstrable project of the Library 2000 Group was to develop an on-line catalog with a Telnet interface (and later a World-Wide Web interface) for the LCS/AI reading room. This catalog has two interesting features:

Stateless protocol. The user interface and the search engine are a client and a server, separated by a network connection with a defined protocol which, in contrast with Z39.50, is stateless. The separation allows the search engine and the user interface to be replaced independently. Being stateless simplifies error recovery and replication.
RAM-based search engine. The search engine is designed to exploit modern large, cheap, random access memory. The initial implementation operates on a machine with 0.5 GBytes of RAM, and the search engine is designed to scale up at least an order of magnitude above that.

The catalog is now a production system.

MIT Library catalog (snapshot). To demonstrate the power of the search engine, the group obtained a snapshot of the M.I.T. Library catalog. Even though that snapshot is now years out of date, the search is so fast and flexible that it remains available as one of the optional collections of the reading room catalog described in the previous item.

Scanning of Computer Science Technical Reports. The next major activity of the group was to initiate high-resolution scanning of the accumulated collection of Technical Reports and Technical Memoranda of M.I.T Project MAC, The Laboratory for Computer Science, and the Artificial Intelligence Laboratory. This work has several aspects:

Document organization. We developed a standard document scanning record to capture information needed by a browser, and an archiving specification to preserve the scanned images on CD-ROM.
1.2 TByte archive. Over a period of three years we scanned nearly 1,000 reports, about half the total that have accumulated since 1963. These images, totally about 1.2 Terabytes of data, have been preserved on recordable CD's.
FTP service. We processed the scanned images down to two forms suitable for on-line distribution, and made them available on-line via FTP.
Cooperative CS-TR project. As participants in the Computer Science Technical Report (CS-TR) project we helped develop a simple bibliographic record format, created bibliographic records in that format for all existing Project MAC, LCS, and AI reports, and made both the records and the scanned images available via the National Computer Science Technical Report Library (NCSTRL) using the Dienst protocol.
MIME-based document standard. Andrew Kass developed An Interchange Standard and System for Browsing On-Line Documents using MIME as a starting point, as his Master of Engineering thesis.

Heuristic nearest-match search. An important goal of Library 2000 is that a person reading a document on-line be able to circle a citation, click on it, and expect the cited document to appear in an adjacent window. One key problem in implementing this idea is how to identify a document from another author's citation. Eytan Adar developed and demonstrated a remarkably powerful algorithm for doing nearest-match searches using a Boolean (exact match) search engine.

Text-image maps. A key problem in following citations found in scanned documents is identifying the text that corresponds to the area on the page that the reader has circled. Jeremy Hylton developed a text-image mapping scheme that solves this problem. That work led to a paper for the WWW'94 conference about text-image maps and Jeremy was inspired to write a report on that trip.

Citation de-duplication. Another problem in searching for documents is that a large number of non-obvious duplicate citations is sometimes encountered. Jeremy Hylton developed a two-tiered deduplication scheme using Adar's heuristic algorithm at one tier and vector matching on the second tier. He applied this scheme to a collection of 250,000 computer science citations and found that he could reliably collect together not only multiple citations to the same work, but could also identify closely related records, such as work that has appeared in a workshop, a technical report, and a published paper.

The Digital Index for Works in Computer Science (DIFWICS) is an experimental demonstration of this work, which is described in detail in Hylton's thesis Identifying and Merging Related Bibliographic Records (M.Eng., February, 1996).

Document navigation (URI's). A library catalog needs a way of linking to on-line documents that does not capture short-lived information such as host, file, and directory names. To this end, we proposed a scheme to identify and locate technical reports using the Internet Domain Name System. Ali Alavi wrote a thesis in which he implemented the scheme, and we export a navigation service that can convert a Technical Report Identifier into a list of available representations or a URL that leads to its scanned images.

Integrity of long-persistence data. The group has given considerable thought to the problem of assuring that geographically dispersed replicas of databases the size of a digital library can be kept consistent over a long time in the face of inevitable media decay and possible site disaster. Because the requirements are quite different from those of a locally-replicated database, different algorithms and continual testing are required. Although a summary paper has not been written, several working papers and a thesis exploring some novel ideas are available.

Digital Library Architecture. Quite a number of architectural ideas were explored, and the group reached general agreement on the main architectural points (somewhat different from conclusions of other projects), but a single coherent document was not completed. Instead we have a series of notes on the Architecture of the Digital Library.

Last update: November 23, 1997, by jhs

Return to Library 2000 home page.