The basic hypothesis of the project is that the technology of on-line storage, display, and communications will, by the year 2000, make it economically possible to place the entire contents of a library on-line and accessible from computer workstations located anywhere. The goal of Library 2000 is to understand and try out the system engineering required to fabricate such a future library system. The project's vision is that one will be able to browse any book, journal, paper, thesis, or report in the library using a standard office desktop computer, and follow citations by pointing--the thing selected should pop up immediately in an adjacent window.
The key technology required to realize this vision is low-cost disk storage. In the last decade, mass-produced disk storage has fallen in cost/bit by a factor of 1000. Projections of current research and development activity suggest that disks in volume production by the year 2000 will fall in cost/bit by at least another factor of 100. This change of five orders of magnitude calls for a complete rethinking both of what is economically feasible, and also of engineering for effective use. A back-of-the-envelope estimate suggests that the cost of the magnetic disk storage needed to hold scanned images of all the books and serials in a large library will, in the year 2000, be about equal to the annual budget for that library. When floor space is considered, the cost of holding images on magnetic disk will be substantially less than the cost of holding the paper form.
Three other technologies are also advancing at a pace that is likely to provide effective support to an on-line electronic library system:
Our research problem is not to invent or develop any of these four technologies, but rather to work out the system engineering required to harness them in a usable form. The engineering and deployment of large-scale systems is always accompanied by traps and surprises beyond those apparent in the component technologies. Finding things, linking things together (especially across the network), keeping the whole system reliable, and making the system last for a period of decades will all require new ideas.
This architecture is appealing because it allows modular, independent, competitive design and replacement of the user interface, the index services, and the storage services. It also permits a uniform user interface to a wide variety of different library collections as well as to personal files, mail, or other non-library databases. The architecture is simple and the functional boundaries match both physical and administrative boundaries. Finally, traditional library back office operations such as cataloguing, circulation, acquisition, journal control, etc., fit in gracefully as additional clients and servers.
Two technology observations interact to produce an interesting architectural consequence. The cost of magnetic disk memory has for many years been between 10 and 100 times cheaper than that of random-access memory. Similarly, the amount of space required for (compressed) scanned images is between 10 and 100 times larger than that required for the corresponding ASCII text. The architectural consequence is that if one spends about the same number of dollars for each, there will be space in RAM for a complete index of the words stored in page-image form on the disk. This observation leads to an interest in index-preparation and searching algorithms that are optimized to operate directly in large random-access memory, even for very large databases.
The network protocol that connects the presentation client with the index and storage servers is stateless, to achieve robustness in the face of network and server failures. It uses unique identifiers, to allow separation of the index and the storage services. Storage servers are replicated with wide geographical diversity, to counter threats to persistence over time periods measured in decades.
This online catalog has now become the primary catalog of the LCS/AI reading room; the paper catalog was closed in January, 1991. Several support systems, including a simple emacs-based cataloguing system were also developed, so that operations of the reading room could be transferred entirely to the new online catalog. A shelf-reading project is underway, and as of the end of the reporting year all proceedings, all books received since 1980, and all journals are now fully catalogued and checked. All technical reports received since January 1991 are also catalogued; about half of those received before that date are also represented in the catalog. The staff of the reading room reports that circulation of technical reports has increased markedly since the on-line catalog became available.
During the fall, the entire system migrated from a borrowed VAX server to a set of three replicated IBM RS-6000 servers, with a view towards scaling up to much larger quantities of data. The index was recoded from the PERL (interpretive) to the C language, and it is now in the midst of a complete design revision. By taking advantage of the large RAM available on the RS-6000s, this revision is intended to provide the high performance needed to allow indexing of the 700,000-record M.I.T. library catalog. During the spring, this new index server was tested on a 25% sample of the M.I.T. catalog with good results; the complete index system should be ready to deploy some time this summer. Mitchell Charity did most of the work on the prototype design and implementation.
Three related projects were completed this year. First was a parser for library standard (MARC) records, by undergraduate Art Min. Second was a preliminary prototype of an X-based user interface to the online catalog, initiated by Mitchell Charity with further development by Jeremy Hylton. The third was a completely different user interface based on the Organization Engine, an information organizing system in use at the Digital Equipment Corporation Cambridge Research Laboratory. This interface was developed as an undergraduate thesis at Brandeis University by Ron Weiss.
Currently, Manish Muzumdar is developing a general client interface toolkit, and Robert Miller is working on the indexer overhaul.
The Distributed Library Initiative is a new, joint activity of the M.I.T. Libraries and M.I.T. Information Systems. Its purpose is to develop and experiment with new technologies for the M.I.T. Library system. The DLI is installing prototypes of possible future library electronic delivery systems, and is working with publishers who are undertaking experiments with electronic distribution systems. Library 2000 is working with DLI at several levels to help develop the DLI plans, to look for opportunities to avoid unnecessary mismatches of data formats, protocols, and programming interfaces, and to make joint use of software where feasible. The acquisition by Library 2000 of the M.I.T. library catalog and its update stream is one tangible result of this collaboration.
The technical report collaboration represents an extension of the prototype system in another interesting direction, that of linking indexed text with page images. The plan of the collaborators is that each university (initially Stanford, the University of California at Berkeley, Cornell, Carnegie-Mellon, and M.I.T.) would place page-images of its computer science technical reports in a local server. Each would also prepare electronic bibliographic records and distribute them to the other participants. Each participant would then take this set of bibliographic records, place it in some local index/access system, and work out ways of providing links from those records to the local and distant page image servers. This plan has progressed to the point that the Corporation for National Research Initiatives, acting as coordinator, has made a proposal that DARPA fund the project. In anticipation of funding, an online discussion group has already developed a proposed format for exchange of the bibliographic records. The M.I.T. portion of the proposal is actually a joint activity of Library 2000 and the M.I.T. Library System, with the library system planning to do scanning and paper delivery, and a potentially expanded version of the Library 2000 prototype as the underlying search, storage, and presentation system.
The addition of page images actually leads to extensions in several different areas, each exposing a variety of new system design problems:
Two other, related projects are also contemplated: These plans are actually too ambitious when compared with the intended size of the research group; some selection will occur.
Saltzer, J. H., "File System Indexing, and Backup," in Operating Systems for the 90's and Beyond, Lecture Notes in Computer Science 563, edited by Arthur Karshmer and Juergen Nehmer, Springer-Verlag, New York, 1991, pp. 13-19.