The basic hypothesis of the project is that the technology of on-line storage, display, and communications will, by the year 2000, make it economically possible to place the entire contents of a library on-line and accessible from computer workstations located anywhere. The goal of Library 2000 is to understand the system engineering required to fabricate such a future library system. The project's vision is that one will be able to browse any book, journal, paper, thesis, or report using a standard office desktop computer, and follow citations by pointing--the thing selected should pop up immediately in an adjacent window.
Our research problem is not to invent or develop any of the three evolving technologies, but rather to work through the system engineering required to harness them in an effective form. The engineering and deployment of large-scale systems is always accompanied by traps and surprises beyond those apparent in the component technologies. Discovering relevant documents, linking archival materials together (especially across the network), and maintaining integrity of an archive for a period of decades all require new ideas.
This is the third reporting year of the project. The first two annual reports laid out the vision and described the overall architecture of a testbed system. This year we report the first completed academic theses, the start of production scanning, and significant progress in designing and implementing the architecture. We have also met several times with the other members of an informal consortium, known as the CSTR group, which by agreeing to exchange on-line materials is developing both a useful library of computer science technical reports and interoperability experience. The other members of the CSTR group are Stanford University, the University of California at Berkeley, Carnegie-Mellon University, Cornell University, and the Corporation for National Research Initiatives.
The scanning station currently consists of a network-attached MacIntosh Quadra 840AV equipped with 64 Mbytes of RAM, 4 Gbytes of disk, one DAT, two SCSI channels, and two SCSI-attached scanners: an Hewlett-Packard IICx and a Fujitsu 3096G. Software was initially a package named Optix, from Blue Ridge Technologies, Inc.; we recently switched to a package named Cirrus, from Canto Technologies, together with a workflow manager written in AppleScript. As we gain production experience and conquer work flow bottlenecks, we expect this configuration to increase in disk storage and DAT capacity.
Our overall scanning strategy is to locate the earliest-generation originals that are available, and extract the maximum amount of information possible with production-capable hardware. The two scanners are capable of a resolution of 400 pixels per inch, with eight bits of grey-scale per pixel. (The HP scanner can also provide 24-bit color, but we are using that ability only for individual pages that intentionally make use of color.) This resolution produces image files that are quite large, about 16 Mbytes per scanned page, thus requiring careful organization of storage and workflow. At the current rate of scanning, we are acquiring about 1 Gigabyte of raw data per day. Our approach is immediately to compute a checksum signature of the raw data, archive that raw data to tape, and then to (reversibly) compress and transfer it out of the scanning station to a server site for additional processing before storing it for distribution. Currently, this additional processing consists in reducing the data to an agreed-upon standard interchange form: 600 pixels per inch, one bit per pixel, using CCITT Group IV FAX compression in TIFF format. In interchange form, a typical page occupies about 100 Kilobytes of storage, about 0.5% of the size of the raw data. Only about 7% of the originally scanned information is retained in this reduction; the remaining factor of ten comes from compression. The archived raw data remains available for experiments and for future uses when management techniques for larger volumes of data are more refined.
Because the volumes of data are so large, bottlenecks of various consequence turn up at every step that precedes reduction to interchange form (scanning time, disk writing time, DAT writing time, checksum calculation time, compression time, network file transfer time, and reduction itself), and the development of an effective workflow has involved quite a bit of trial, backtracking, and alternative-searching. Initial production work therefore used second-generation originals, and because of complications in matching software and hardware capabilities, was carried out at 300 dpi, 8-bit grey-scale. About 40 technical reports, totalling 3900 pages and about 40 Gbytes of raw data have been scanned as of June 30, 1994. Software is now in place to allow scanning at 400 dpi, and enough experience has been gained that a cutover to first-generation originals is expected sometime later in the summer of 1994.
It is apparent that when scanning a document, one should capture more than just the bits of the images, but it is not yet clear exactly how much more. Information about the resolution of the original scanner is needed to display the images correctly. Information that image 59 is a color scan and image 60 is a monochrome scan of the same page is needed to interpret the set of images properly. Information about pagination and position of blank pages is needed for a browser to display the images in the proper order when a user says "next page" or "go to page 43". Scans of standard test targets may also be added to the set of images of a document to allow calibration of output devices. We have begun to design a standard scanned document description record, whose purpose is to capture these various categories of information in a uniform, machine-readable form.
We are also exploring various issues that arise when scanning in gray-scale, such as scanner calibration and quality control in the absence of output devices that have information density as high as that of the scanners. We are experimenting with test targets from the Association for Information and Image Management and the Institute of Electrical and Electronic Engineers to see whether or not scanner calibration can be checked by computer analysis of the scanned images of those test targets, rather than visually. With this approach, we hope that visual quality control can be simplified to looking for obvious errors such as missing, askew, or upside-down pages.
In a closely-related task, Jeremy Hylton and Yoav Yerushalmi wrote utilities to convert 300 dpi and 400 dpi gray-scale images to 600 dpi monochrome form for printing. These utilities are two of several needed at various processing points for images. The basic issue being explored here is that different image transformations are needed for printing, display on various output devices, and for long-term storage. To speed up any particular display situation, one can do precomputation of the required form, but holding the results until they are needed requires storage and precomputation takes time that is wasted if the result is not actually used. The tradeoffs are complex and constantly changing with technology.
Detailed design, configuration, and debugging of the scanning station and its internal workflow have been done largely by Jack Eisan, Jeremy Hylton, and Andrew Kass, with software contributions by Eytan Adar. Overall image processing workflow has been addressed by the entire group, with additional advice from Mary Ann Ladd and Sally Richter, who manage the publications of the Laboratory for Computer Science and the Artificial Intelligence Laboratory, respectively.
The CSTR group has agreed to use RFC-1357 as a standard bibliographic exchange record. That standard contains an open-ended (which really means not-yet-defined) "RETRIEVAL" field, intended to provide the information needed to allow automated location of the rest of the document, such as the name of an FTP site and the name of the file at that site to transfer. It also defines an "ID" field, a standard two-component document identifier, such as STAN//CSL-TR-91-004. This identifier consists of, before the double slash, the identifier of the distributor of the document and, after the double slash, the identifier that this distributor uses for the document.
In discussions about use of RFC-1357 it became apparent that the concept of embedding retrieval information in a RETRIEVAL field is flawed. All of the other information in a bibliographic record is relatively permanent, because it describes the document itself, which doesn't change very often. But the to-be-defined RETRIEVAL field presumably describes the network site, the method of obtaining the document, and the available formats, all three of which we can expect to change (perhaps frequently) with technology and with development of new ideas. In addition, this kind of retrieval information is properly the province of technologists, not of librarians; it is the only field in the bibliographic record that is not determined by bibliographic considerations. One can thus argue that including this kind of retrieval information explicitly in the bibliographic record violates some kind of packaging or abstraction boundary. The value in maintaining a set of information items about a document that is purely bibliographic is that the resulting bibliographic record can be freely circulated and there need be no inhibition about storing it for an indefinite time and forwarding copies to others.
Ali Alavi undertook a Master of Engineering thesis to explore the following idea: rather than adding a RETRIEVAL field to the bibliographic record, consider the ID field to be a universal name for the desired document, and present that name to a network name service, which would return the temporally fragile information needed for retrieval. With help from Mitchell Charity, Alavi designed and implemented this system using the Internet Domain Name Service (DNS) to associate each ID as defined in RFC-1357 with an Internet Uniform Resource Locator (URL), a standard form for retrieval information that was developed elsewhere for the World-Wide Web. He primed a DNS server with a set of ID-URL pairs for several thousand currently on-line technical reports of the five universities in the CS-TR group.
To demonstrate the usefulness of this approach, Alavi's thesis went one step further, modifying the user interface of the Library 2000 catalog system so that whenever a search result includes a technical report,the system submits the ID found in the catalog to the DNS ID-URL server. If the server returns a URL the modified user interface pops up an extra active button on the user's screen, giving the user the ability to retrieve the on-line version of the report.
This demonstration was successful, and it is now the basis for discussions with the CSTR group on the relation between ID's and "handles," a document registration system proposed by CNRI. We are also monitoring the work of the Internet Engineering Task Force on proposals for Uniform Resource Names and Uniform Resource Characteristics, two closely-related ideas.
At a higher level of linking, Eytan Adar developed a program to take technical report citations as found in typical published documents and precisely identify the corresponding documents by performing a catalog search. This problem, which seems easy at first, is severely complicated by the existence of varying citation styles, multiple forms for author and publisher names, incorrect and incomplete citations, and other similar real-world variations. As a result, heuristics seem to be an essential component of any automated lookup process. Adar's program selects several independently random sets of words from the original citation, performs a catalog lookup for each such set, and compares the results. It presents to the user the document or documents that matched several of the independent sets. The system has proven to be surprisingly effective for such a simple heuristic.
This citation lookup system is especially interesting when coupled with the text-image map developed by Jeremy Hylton. This map relates scanned images to the corresponding ASCII text. The goal here is to allow two symmetric features. In one direction the user selects a region of interest in a scanned image that contains, for example, a citation to another document or a page number. The image-text map allows the system to identify the ASCII text that corresponds to the selected region. In the case of a citation, the system can then invoke Adar's citation lookup program to locate the cited document. Or, if the selected region contains a page number, the system can, using the scanned document description record mentioned earlier, identify which page image corresponds to that page number and display it.
In the other direction, when a user initiates a catalog search using particular words or phrases and asks to display a document that contains them, the text-image map allows the system to locate the appropriate page of that document and highlight the region of the image that contains the words or phrases of the original search.
Hylton worked out methods of representing text-to-bounding box mapping information and did initial demonstrations of both page number lookup and citation search. He also looked into exploiting a feature of some OCR software, such as Xerox ScanWorX, that returns not just ASCII text but also the coordinates of the bounding boxes in which it found the text.
As another linking and interoperability demonstration, group members created a tool to print technical reports from images. This package, when given the name and page numbers of a technical report at a different CSTR member site, will locate that technical report at the other site's storage service, grab the requested page images, and send them to a nearby printer.
Our current thinking on the design of this survival system is that it should be organized in independent components. A first component is a data reader, associated with each replica, that continually compares that replica's current data with a record of yesterday's data. A second component is a set of cross-checkers that compare the replicas. A third component performs any needed repairs to the data, based on the discoveries of the first two. The objective of the project is to create a design for these components that is so straightforwardly simple that one can make a credible claim that it will work correctly. Tukovinit documented this multi-component model, and Yoav Yerushalmi and Mitchell Charity are now refining and prototyping the model. A major point of departure (and simplification) of the model they are currently working with is that a storage server should be immutable except for addition of new files of data.
To this end, Library staff are working on the organizational issues of processing data: the bibliographic record, the database, and scanning and image management. Service issues will also be explored: the delivery and presentation of this information to users, and the integration of this service into the Libraries' array of services. Finally, the Libraries will serve as an evaluation testbed for the service as it is made available widely to the M.I.T. community.
Within the five-member CSTR group, library representatives have also engaged these discussions. The CSTR group is planning for the day when the research project will end, at which time the CSTR libraries must be prepared to continue the services developed and to support an operational system that is sustainable, maintainable, and scalable to include other academic materials beyond computer science technical reports. Greg Anderson is coordinating an effort by the CSTR library representatives to identify, understand, resolve, and build the structures necessary for an operational system. At the CSTR research group meeting in Palo Alto in Februrary 1994, he presented an outline of those issues identified to date. There are three areas of focus: processing, services, and evaluation.
Specific issues related to processing of technical reports include agreement on continued use of the bibliographic record format (RFC 1357) and that format's relationship with library catalogs and MARC (Machine Readable Cataloging) records. The library group is participating in the discussion of "handles" and the linking conventions between index records and the information content object and is trying to understand how to create and manage handles in a scalable, production workflow. Each campus will have ongoing responsibilities for the database of its own cstr images, and the library must be prepared to support, grow, and maintain the database. Libraries must be able to continue to convert existing paper-based technical reports for the foreseeable future and therefore must understand the issues related to operational scanning operations and image file management. Over time, however, some technical reports may exist only in electronic form, and libraries must be prepared to accomodate them in their normal workflow.
The CSTR group libraries have many issues related to the delivery of this information to their campus communities. The libraries have begun to address the presentation of technical reports online, their indexing and retrieval, and their integration into the library's existing array of networked, electronic services. The library group will try to list and analyze the various clients being used on the consortium campuses and will work to integrate the CSTR services coherently into the libraries electronic offerings.
The library group believes that the CSTR project should begin planning for the evaluation of the project as it nears completion and to be prepared to feed those evaluation results into the ongoing development of the operational system.
The CSTR library group will continue to delineate this set of issues for the consortium and will intensify its planning effort in the next year.
Mitchell Charity spent a fair amount of time exploring the current market in magnetic disks and disk packaging. That exploration led to two conclusions:
The discounts available for purchasing magnetic disks in large quantities (on the order of 10%) are small relative to the rate at which market prices for disks are dropping with time (5% per month) so there is no point in buying ahead of need. A just-in-time ordering system (perhaps having the computer system note that it is filling up the last disk and automatically printing a purchase order for the next one) seems to be the right strategy.
Packaging at the level of placing 3.5-inch magnetic disk drives in arrays has not kept pace with the disk drives themselves. The best available disk packaging systems are very space-inefficent, wasting 75% of the box volume. We did find one very promising example; Digital Equipment has developed a nicely space-efficient system of disk packaging, but DEC hasn't yet figured out that it should market this packaging system as a separate product.
On this basis, we acquired an interim packaging system to allow us to store the images coming from the scanning system. We are hopeful that future orders of packaging systems may be able to take advantage of the developments at DEC. As of the end of June, 1994, the storage servers were equipped with 22 Gbytes of storage, mostly in the form of 2.0 Gbyte 3.5-inch disks.
To obtain an additional storage server for replication experiments we worked out an arrangement with M.I.T. Information Services in which they provide an otherwise disused RS/6000 520, we equip it with magnetic disks, and we share the resulting server between the CSTR project and the TULIP project, another image-delivery experiment in which we are jointly participating. In addition to the opportunity to share hardware resources, we are getting some additional benefit by mixing and matching software packages, some of which we are developing and some of which they are developing.
Eytan Adar established a Wide Area Information Service (WAIS) source consisting of the entire set of RFC 1357 technical report bibliographic records provided to us by the five CSTR institutions, updated daily.
Jeremy Hylton created a World-Wide-Web home page, organized older materials of the research group, and inspired the other members of the group to create a collection under that home page that provides a showplace and distribution center for our research work. The Library 2000 home page can be found by following the path M.I.T. --> Laboratory for Computer Science --> Library 2000 group.
Charity, Mitchell N. "Multiple Standards? No Problem." Proceedings of Digital Libraries '94 (June 19-21, 1994) College Station, Texas, pages 203-204.
Yuk Ho. Application of Minimal Perfect Hashing in Main Memory Indexing. Master of Engineering thesis, June, 1994.
Kom Tukonovit. Fault Detection in L2000 Storage-Replication Service. Advanced Undergraduate Project, May, 1994.
Anderson, T. Gregory. "The Computer Science Technical Report Project." Coalition for Networked Information meeting, Washington, D.C., April 5, 1994.