The basic hypothesis of the project is that the technology of on-line storage, display, and communications will, by the year 2000, make it economically possible to place the entire contents of a library on-line and accessible from computer workstations located anywhere. The goal of Library 2000 is to understand the system engineering required to fabricate such a future library system. The project's vision is that one will be able to browse any book, journal, paper, thesis, or report using a standard office desktop computer, and follow citations by pointing--the thing selected should pop up immediately in an adjacent window.
Our research problem is not to invent or develop any of the three evolving technologies, but rather to work through the system engineering required to harness them in an effective form. The engineering and deployment of large-scale systems is always accompanied by traps and surprises beyond those apparent in the component technologies. Discovering relevant documents, linking archival materials together (especially across the network), and maintaining integrity of an archive for a period of decades all require new ideas.
This is the fourth reporting year of the project. The first three annual reports laid out the vision, described an overall architecture, and reported the development of a production scanning system. This year we report on improvements in the production scanning operation, including addition of quality control. We also have made progress on representation of scanned documents, automatic linking mechanisms, and archival replication. In conjunction with the other members of an informal consortium, known as the CSTR group, we have made our scanned images available to the public using the Dienst server software developed by Jim Davis and Carl Lagoze of Cornell University. The other members of the CSTR group are Stanford University, the University of California at Berkeley, Carnegie-Mellon University, Cornell University, and the Corporation for National Research Initiatives.
The scanning station currently consists of a network-attached Macintosh Quadra 840AV equipped with 64 Megabytes of RAM, 12 Gigabytes of disk, four DAT drives, two SCSI channels, and two SCSI-attached scanners: a Hewlett-Packard IICx and a Fujitsu 3096G. The software used with the Fujitsu scanner is a package named Optix, from Blue Ridge Technologies, Inc.; with the HP scanner we use a package named Cirrus, from Canto Technologies, together with a workflow manager written in AppleScript. Production work is done with the Fujitsu scanner, and the HP scanner is intended for exceptional items such as color photographs. Adjacent to the scanning station is an IBM RS/6000 workstation configured as a quality control display system, and a nearby Hewlett-Packard model 4SiMx, 600 dpi printer also supports quality control.
Our overall scanning strategy is to locate the earliest-generation originals that are available, and extract the maximum amount of information possible with production-capable hardware. The two scanners can acquire 400 pixels per inch with 8 bits of grey scale; this resolution produces image files that are quite large, about 16 Mbytes per scanned page, thus requiring careful organization of storage and workflow. At the current rate of scanning, we are acquiring about 4 Gigabytes of raw data each day. Following each day's work, an unattended overnight job archives the raw data to Digital Audio Tape and transfers it out of the scanning station to a server site for additional processing before storing it for distribution. Currently, this additional processing consists in reducing the data to an agreed-upon standard interchange form: 600 pixels per inch, one bit per pixel, using CCITT Group IV FAX compression in TIFF format, and also to a 100 pixel per inch, 5-bit form for fast display and quality control. These two forms remain on-line; the archive DAT retains the high-resolution data for future use.
Scanning production reports pages pages/ Gbytes archived week Cumulative total, June 30, 1994 40 3900 40 July 1-September 30, 1994 62 2161 165 October 1- December 30, 1994 86 2800 215 January 1- March 31, 1995 146 8440 650 April 1-June 30, 1995 165 12086 930 Cumulative total, June 30, 1995 499 29430 400
Improvements in scanning rate have involved several steps, many seeming mundane or straightforward but together they illustrate the intricacy and complexity of a large-scale scanning activity. The major ones were:
Our next target is 2000 pages per week, but a difficult bottleneck that has recently emerged is the frequency of paper feed jams. In many cases the originals of the TR's have warped slightly during storage, or for some reason stick together, leading to jams in the automatic paper feeder of the scanner. We have seen this problem coming for some time, and have tried a number of strategies to deal with it, so far without success.
During the year, we began placing the scanned technical reports on-line for public access. In addition to images obtained from the production scanning operation, we are now receiving authoritative PostScript for many new technical reports, and are converting those to image form by direct conversion. The number currently available are:
post-scan processing 157 9300 postscript conversion 58 4600 total 215 13900
These numbers indicate that not all scanned reports have been placed on-line. Images are scanned during the day, and that night, during unattended operation we make archive copies on Digital Audio Tape (DAT) and we transfer the image files by FTP to be processed for on-line placement. However, if for some reason unattended FTP is not successful within the next 24 to 48 hours, we have chosen to delete the scanned images from disk to make room for the next day's scanning--typically we generate four to six gigabytes of raw data per day--rather than hold up scanning until post-processing is complete. This choice allowed us to attack separately scanning bottlenecks and post-processing bottlenecks. The strategy seems to have been successful, as shown by the rapidly increasing production rate. By year's end, individual reports of up to four gigabytes (250 pages) were being FTP'ed and processed. The miss rate has been declining, and missed reports will eventually be retrospectively restored from DAT and processed.
As scanning volume increased, it was necessary to install larger disk drives, so two 4-Gigabyte disk drives were acquired for the scanning station. To use these larger drives properly required installation of Mac Operating System 7.5, which in turn raised a number of compatibility problems with other software. Resolution of these problems took careful analysis and interfered with production for about two months, but the overall impact of the change was a major increase in the production scanning rate. In addition, larger technical reports, up to 250 pages in size, can now be accommodated without the special handling that was required when they had to be split up across several disks.
The larger disk space has allowed us to abandon compression on scanned images before archiving them to DAT. Instead, the raw, unprocessed bit-map images (15 Mbytes per page) are now being preserved in the off-line medium. The use of compression in archiving has been a point of considerable discussion. In this case, the time spent compressing the images was larger than the time saved in archiving, so the overall production rate increased when we abandoned compression. In addition, we have always been concerned about increased fragility that comes with removing redundancy from archived materials. The discovery that some previously archived and compressed images, when retrieved and decompressed, were unopenable as images convinced us that our concern for fragility was well-placed.
During the year we had three failures of 4-Gbyte drives. All were replaced under warranty; as best we can tell from inspection of the replacements, on the first two failures the vendor increased the air flow through the drive cabinets to reduce their internal temperature; the third failure was of a power supply. Although we have been careful to install only disk drives that have had several months of field experience elsewhere, that is not enough to ensure that all design problems have been found. It would be better to use drives that have 18 months or more of field experience, but the disk drive business currently is changing too rapidly for that approach to be economic. We believe that the six-month experience accumulation that we currently require represents the best current trade-off between up-to-dateness and seasoning, though a production activity that did not have the same level of local technical expertise might find it more effective to insist on a longer accumulation of field experience.
Finally, near the end of the year, as part of the scanning operation we have instituted a quality control stage. Several group members formed a task force to develop quality control criteria and workflow, and at the end of the year a first draft description of the quality control activity was under review. Currently images, file transmission, and processing activity are being monitored. Mechanisms for reporting problems, recording of quality control completion, and archiving the result are being investigated.
the document scanning record a text specification of the document scanning record a set of image files a file containing sizes and checksums of all the other files
The checksum file is intended to be merged into the document scanning record, but current software tools aren't up to the job, so for the moment it is a separate file.
The document scanning record has gone through a number of versions, and is
likely to continue to be refined as more experience is gained in its use
and its ability to support archiving and document browsing. The current
version is identified as CSTR.1.3, and a specification of its contents can
be found in
Abstract: With the advent of fast global digital communication networks, information will increasingly be delivered in electronic form. In addition, as libraries become increasingly more computerized, not just card catalogs but entire books will be stored on-line. When delivering digital documents from on-line libraries, two problems must be addressed; delivering a document which may be available in multiple formats and preserving relationships among the different formats. This thesis describes a system for delivering digital documents, called digiments, which can be in multiple arbitrary formats, while preserving the relationships between the parts and meta-data about the document itself.
The Need for a Digital Document: For immediate transfer of data, a plethora of methods, formats and protocols may be used, both for the transfer, as well as for the content itself. However, for transferring between different systems, or transferring a document that may have content in arbitrary or multiple formats, a standard way of representing a document is needed. In addition, documents have additional information that is not necessarily reflected in the content of the document, such as copyrights, publisher, ISBN number, etc. Furthermore, this information, even if present, needs to be available in an architected way for displaying. This "meta-information" needs to travel with the content of the document, regardless of the format the content is in.
By defining a digital document and creating a standard for its transmission, it is possible to solve these problems. By allowing the content to be in any format, but having the digital document describe the format, it is possible to use any format, present or future, for the content itself. By creating a standard way to describe the document's meta-information and content format, it is possible to transmit the document without having to worry about what format it is stored in. Finally, documents can be stored and delivered in multiple formats, making it possible to view both the scanned images and ASCII text of a document.
Digital Documents: In order to create on-line documents, we need to define what a digital document is. In this model, a digital document is data in arbitrary format accompanied by meta-information, which is part of a document, but not necessarily contained in the document itself. In order to easily transfer digital documents, I am creating a set of MIME  types, which describe the format of a document and contain the meta-information, while still allowing arbitrary data formats.
What exactly do we mean by a document? A document can be a book. It can be a magazine or periodical. It can be a bibliography of a technical field. A technical report is a document. Instruction manuals, TV listings, newspapers, rulebooks, and catalogs can all be considered documents. What do we mean by a digital document, or a document stored on a computer? A digital document can be composed of a collection of images in GIF, JPEG, TIFF or other format. It can be a Postscript, LaTeX or ASCII text file or set of files. It can be a set of images, the ASCII text, and a text to image map that relates the two. It can include sound. In short, there is no easy definition for a digital document. Nor do digital documents correspond well to what we call documents in the physical world.
For these reasons, we need to create a new type of object, called a digiment, for digital document. Defining a digiment will allow us to achieve some of the goals expressed above. It will provide a standard way to pass around digital documents, regardless of the format in which the content itself is stored. It will provide a standard for archiving and retrieving digital documents. It will provide for the use of multiple representations of a single "document" that can be linked together, as with a text to image map. It is intended to accommodate the use of future data formats transparently. And it will provide meta-information about a document, such as author, publisher, and copyrights. For comparison, in a book, the text on the pages of the book is the content, while the meta-information is found on the inside of the cover page, such as ISBN number, copyright and publisher.
The digiment standard is not a specific data format for storing or transmitting the content of digital documents. Rather, it is a container for transmitting or storing documents in arbitrary formats. A digiment consists of the data itself, which can be in any form, along with some associated meta-information structural information. This additional information is what differentiates a digiment from a simple set of images or data files. The structural information specifies the format that the data itself is stored in, whether it is image, text, Postscript, etc. It also specifies how the different data parts of the digiment are related to each other. The meta-information includes bibliographic information about the digiment, such as the author, publisher, copyrights, distribution rights, and relationships between different formats. By specifying the digiment in this way, it is possible to use any type of format for storing the actual data, including formats which may be created in the future.
The completed thesis has been deposited in the M. I. T. library, is
currently available on the World-Wide Web at the URL
The explosive growth of the World-Wide Web over the last two years underscores the realization that the technology exists -- or will soon exist -- to build the digital libraries envisioned since the earliest days of computing. The experience of using the Web also makes it clear that linking and resource discovery are complicated problems, at least in part because of the sheer quantity of data available.
Resource discovery is difficult not only because there is so much data to sift through, but also because so many varying copies of resources exist. When performing a search with Archie, an index of anonymous ftp sites, a query for a particular application will often discover dozens of copies of the same resource and return a long search result set that contains little information.
One technique that can be used to manage the discovery and location of large quantities of data is deduplication. Exactly as the name suggests, deduplication is the process that narrows a list of resources by eliminating any duplicate references to a resource.
Deduplication is an interesting problem because it is difficult to decide when two resources are duplicates. Consider two possible uses of a digital library: A student searches for a copy of Hamlet in the library; the digital library actually has many copies of Hamlet -- some by different editors, some in different collections -- but the student only wants to know that a copy of Hamlet is to be had. For the student, deduplication of the list to one or two entries provides a valuable simplification. On the other hand, a Shakespeare scholar may want to find a copy of one of the early printed editions of Hamlet. For that scholar, deduplication needs to work within a narrow sense of equality. The scholar does not need to know that there are three libraries with copies of the first quarto, but does need to know that copies of the first and second quartos exist.
The complete proposal may be found at
As part of this thesis, Jeremy generated a new set of Technical Report bibliographic records from the publications databases. The new records are an improvement over the previous batch because the publications databases have been improved. For example, the AI Laboratory records now include authors' full names instead of last name and initial; we also may have information about grants and retrieval. This work was undertaken so as to have a body of material on which to perform deduplication experiments. Jeremy also did some preliminary work on a tool for finding duplicate bibliographic records in the MIT card catalog and in the CSTR database. The tool manipulates MARC records in OCLC format and RFC 1357 records and produces merged RFC 1357 records as output. The thesis is expected to be completed during the fall of 1995.
The complete demonstration, usable from any modern World-Wide Web browser,
can be found at
1. ID.url.ltt-ns.lcs.mit.edu returns a URL to the place that images of the specified document can be found, if images exist.
2. ID.formats.ltt-ns.lcs.mit.edu returns a list of formats in which the document is available.
The navigation management system now polls the five CS-TR sites for new documents once a week. If a particular site is down at the time of the poll, it keeps the previous week's data for that site and skips updating for that week. Retaining the old data also happens if a server suddenly "loses" many documents; that is, if the new list of documents is significantly smaller than the previous list.
Andrew also developed a World-Wide Web demonstration of the navigation
service. This demonstration includes both an overview and technical
description of how the navigation service works and what it is used for. The
demo allows one to look up any Technical Report which is on-line using the
RFC-1357 ID. It then returns the URL of the Technical Report, as well as
the formats available. In addition, it shows the actual DNS queries used to
obtain the information. There is also a second demonstration which shows the
practical use of the Navigation service to find out information as opposed
to querying the server for the Technical Report each time; the former is
more than an order of magnitude faster. This demonstration can
be found at
A current hypothesis of our replication research is that the underlying stored objects should be thought of as immutable. Immutability gives the opportunity to use a long checksum of the data itself as the unique name of the object, for example as the opaque string that appears in its URN. Mitchell developed an experimental system that names objects in this way. In addition this experimental system provides a testbed in which we can launch various bit-rotting demons and watch how well the detection and repair algorithms cope.
Eytan Adar and Jeremy Hylton. On-the-fly Hyperlink Creation for Page Images. Proceedings of the Second International Conference on the Theory and Practice of Digital Libraries, College Station, Tex., June 11-13, 1995.