Annual Progress Report

Library 2000
July 1, 1995--June 30, 1996

Academic Staff

Jerome H. Saltzer

Research Staff

Mitchell N. Charity

Graduate Students

Jeremy A. Hylton

Undergraduate Students

Eytan Adar

Gillian D. Elcock

Geoff M. Lee Seyon

M.I.T. Library Staff

T. Gregory Anderson

Michael W. Cook

Lindsay J. Eisan, Jr.

Keith Glavash

Support Staff

Rachel Bredemeier

1. Introduction

Library 2000 is a computer systems research project that is exploring the implications of large-scale on-line storage using the future electronic library as an example. The project is pragmatic, developing a prototype using the technology and system configurations expected to be economically feasible in the year 2000. Support for Library 2000 has come from grants from the Digital Equipment Corporation; the IBM Corporation; DARPA, via the Corporation for National Research Initiatives; and uncommitted funds of the Laboratory for Computer Science.

The basic hypothesis of the project is that the technology of on-line storage, display, and communications will, by the year 2000, make it economically possible to place the entire contents of a library on-line and accessible from computer workstations located anywhere. The goal of Library 2000 is to understand the system engineering required to fabricate such a future library system. The project's vision is that one will be able to browse any book, journal, paper, thesis, or report using a standard office desktop computer, and follow citations by pointing--the thing selected should pop up immediately in an adjacent window.

Our research problem is not to invent or develop any of the three evolving technologies, but rather to work through the system engineering required to harness them in an effective form. The engineering and deployment of large-scale systems is always accompanied by traps and surprises beyond those apparent in the component technologies. Discovering relevant documents, linking archival materials together (especially across the network), and maintaining integrity of an archive for a period of decades all require new ideas.

This is the fifth and final reporting year of the project. The first four annual reports laid out the vision, described an overall architecture, and reported the development of a production scanning system. This year we report on the wrap-up of the project, including plans for preservation of the acquired data and transition of the technology, services, and acquired data to more permanent institutions.

2. High Resolution Production Scanning

One of our cornerstone hypotheses is that scanned images of documents will be an important component of many future electronic libraries. Last year, our production scanning system moved from experimental scanning to real production, and worked its way past several bottlenecks. This year, the production system continued to gradually improve its capacity, and a long-term preservation component was added.

Our overall scanning strategy is to locate the earliest-generation originals that are available, and extract the maximum amount of information possible with production-capable hardware. Two scanners can acquire 400 pixels per inch with 8 bits of grey scale; this resolution produces image files that are quite large, about 15 Megabytes per scanned page, thus requiring careful organization of storage and workflow. The scanning system acquires about 4 Gigabytes of raw data each day. Following each day's work, unattended overnight jobs archive the raw data to Digital Audio Tape and transfer it out of the scanning station to a server site for additional processing before storing it for distribution. This post-scan processing consists in reducing the data to an agreed-upon standard interchange form: 600 pixels per inch, one bit per pixel, using CCITT Group IV FAX compression in TIFF format, and also to a 100 pixel per inch, 5-bit form for fast display and quality control. These two forms remain on-line; the archive DAT retains the high-resolution data for future use. Near the end of the year, we began transferring the raw data from archive DAT to recordable CD, a more permanent storage medium.

2.1 Production Rate

The average scanning rate at the beginning of the year was about 1,200 pages per week; it peaked at 1,600 pages per week and then declined as we began handling 30-year-old materials, which had more complex paper handling and readability problems.

Scanning production                 reports   pages   pages/    Gbytes
                                                       week   archived
  Cumulative total, June 30, 1995     499    29,430              400

        July 1-September 30, 1995     106    13,987   1,166
     October 1- December 30, 1995     246    19,538   1,628
        January 1- March 31, 1996     146    16,980   1,415
             April 1-May 30, 1996      53     8,246   1,030

   Cumulative total, May 30, 1996   1,050    88,181            1,320

Note that twice as many pages were scanned this year as in the two preceding years combined. This increase confirms that the major production rate increases achieved during the 1995 reporting year were sustained this year. The number of reports scanned did not increase quite as rapidly, because removing of production bottlenecks allowed us to scan longer reports, which we had been previously avoiding.

Since scanning operations reached full stride, as mentioned above, near mid-year we decided to turn our attention to a problem we had deliberately deferred: scanning the oldest reports in the LCS and AI archives, dating from the mid-1960's. These reports are challenging for several reasons. First, they have been in storage longest and have endured more traumatic experiences than have recent materials. Second, they are generally not printed on acid-free paper, so they are beginning to turn yellow. (Fortunately, deterioration is not so advanced that they crumble when handled.) Finally, they are printed with older technologies (hectograph, mimeograph, and early offset) so there is a wide variation in appearance. As expected, the older materials slowed down production significantly. In addition, some of the poorly-printed materials required hand adjustment or post-scan touchup to be readable. After verifying that hand touch-up was feasible, we decided to defer further work on those materials, because they were absorbing too much personnel time.

The next step of the operation is post-scan processing, reducing and converting the image form and placing the processed form on-line for public access. In addition to images obtained from the production scanning operation, we receive authoritative PostScript for most new technical reports, and create image forms for them by direct conversion.

Reports placed on-line:
                                       reports       pages

  Cumulative total, June 30, 1995         219       13,690
        July 1-September 30, 1995          66        7,200
     October 1- December 30, 1995         145        9,000
        January 1- March 31, 1996         139       15,900
             April 1-May 30, 1996         101       15,733
  Cumulative total, June 30, 1996         570       61,523

The difference between the cumulative total of on-line reports and the cumulative total of scanned reports (from the previous table) is that the scanned images for about 500 reports currently reside only on archive DAT; these are being systematically reloaded and run through the on-line processing system.

Quality control reviews of scanned materials are part of the standard workflow; the review process is finding a small number of problems with the scanned images, but for the most part it appears that the procedures in use produce materials that pass review.

3. Preservation Copying

As indicated by the numbers reported above, we have accumulated off-line some 1.3 Terabytes of raw, high-resolution, grey-scale image data. (Because of our use of 400 pixel per inch grey-scale scanning, this is probably a substantially larger repository than the other CS-TR sites accumulated.) One of our concerns is to assure that this data is preserved for future use.

In our standard scanning workflow, the high-resolution image data is processed down to a lower-resolution form for on-line delivery, display, and printing with current technology, and the original high-resolution image data is copied to Digital Audio Tape (DAT) for future use when on-line storage capacities and processing speeds increase. Since DAT has an expected media lifetime of only a few years, and a probable technology lifetime of no more than a decade, longer-term preservation of the data requires transfer to a more stable medium.

Near the end of the year, we concluded that recordable CD (CD-R) has a suitable balance of properties for preservation: the increasingly popular use of CD-ROM's for computer applications ensures that reading technology will be available for a long time, and studies suggest that the media itself has a possible lifetime of 70 to 100 years. The lifetime requirement at hand is probably only a decade or two, but no other off-line medium provides an intermediate capability, CD-R costs are relatively low, and the projected lifetime provides a margin of safety. We therefore began transfering the data from DAT to CD-R, using a Yamaha/APS 4X CD recorder attached to the Macintosh PowerPC 8500. For recording software we are using Astarte "Toast CD-ROM Pro", version 3.0.

3.1 Blank media

Three recent white papers have explored issues of media compatibility and longevity. A paper from Los Alamos Scientific Laboratory examined error rates of disk blanks from different vendors, of different type, and using different recorders. The primary conclusion from that study is that disks manufactured using cyanine dye (sometimes called gold/green disks) have lower error rates when written at slow recording speeds, and that disks manufactured using phthalocyanine dye (sometimes called gold/gold disks) have fewer problems when written at higher recording speeds.

A second white paper, from TDK, describes heat-accelerated life tests on cyanine-dye disks, and the third white paper, from Kodak, describes heat-accelerated life tests on phthalocyanine-dye disks. These studies suggest that cyanine-dye disks may have a lifetime of 70 years or more if they are stored in suitably dark conditions, and that phthalocyanine-dye disks may have a lifetime of 100 years or more and are less sensitive to bright light.

Anecdotal evidence from on-line news groups suggests that some vendors of cyanine-dye blanks have had runs of bad media. Taking all these considerations into account, we have chosen to adopt phthalocyanine-dye disks as a standard. This choice currently restricts us to two original equipment vendors, Mitsui and Kodak. Since Kodak disks are manufactured with an extra layer of protective lacquer and with a machine-readable bar-code containing the disk sequence number, we have specified the use of Kodak blank media.

3.2 Data format

We have chosen to record data on the CD in the ISO 9660 level 1 format, an interoperable format that can be mounted as a file system by most current computer systems equipped with a CD reader. We are recording image files in TIFF, a self-describing format that is in widespread use. Text files are each recorded three times using three different end-of-line conventions, one for the Macintosh (carriage-return characters separate lines) one for UNIX platforms (line-feed characters separate lines, and one for DOS/Windows platforms (carriage-return and line-feed characters separate lines.) In addition to the original scan record for each document and the scan record specification, a text file describing the background of the project, the workflow and software that we used, and the format and naming conventions used on the CD is written on every CD as a file named "README".

The goal of these details is that each CD be completely self-describing, and we have in addition attempted to provide enough information that if an error in workflow or a bug in the software has muddled the data, a future reader has a reasonable chance of figuring out what happened and possibly compensating. We have verified that the CD's are readable, the file systems are mountable, the text files can be read with any standard editor, and the image files can be opened and displayed properly on a Macintosh, an AIX/UNIX system, and a DOS/Windows 3.1 system.

3.3 Workflow

The overall workflow involves restoring the contents of one or more Digital Audio Tapes to the hard disk of the Macintosh, then rearranging and renaming the files in accordance with ISO 9660 standards. Mitchell Charity developed a script that does this file rearrangement and renaming, taking into account the problem that most technical reports occupy more than one CD. One CD can hold 43 of our high-resolution images (we chose not to compress the images, to avoid the requirement that a future reader figure out how to decompress them.)

Approximately 2,600 blank CD's will be required to complete the preservation project. Our plan is that the Document Services office will continue to perform restores from DAT and recording to CD as a background activity in the months after the contract with CNRI has expired. To this end, we attempted, in the last months of the project, to use our remaining funds to acquire blank media. As it happens, a world-wide shortage of blank CD-R media developed just before we began that acquisition, and it was necessary to do considerable shopping to locate supplies. We now have commitments from suppliers that appear to be sufficient to exhaust available funds. Approximately 1600 CD blanks are either in hand or scheduled for delivery; we intend to ask the Labaoratory for Computer Science for funds for to acquire the remaining blanks.

3.4 CD labeling

Labeling of CD's used for preservation turned out to be a problem. There are two techniques commonly used in the industry, but both are questionable when it comes to long-term archiving. Stick-on labels are available, but they have not been developed with lifetimes comparable to that of the CD media, and there is risk that they (or their glue) will deteriorate. Writing directly on the surface of the CD with a felt-tip pen is the other commonly-used approach, but there are reports of damage to the CD (and consequent loss of data) from some ink solvents. In another white paper on the topic of data preservation, Kodak recommends that only pens specifically recommended for the purpose by the CD blank manufacturer be used. But when asked for a recommendation of a pen compatible with its own disk blanks, Kodak hesitated to identify any specific pen that it considers safe.

In response to these labeling problems, we chose to leave the CD's unlabeled. Instead, we use CD blanks that come with unique manufacturer-supplied serial numbers in both human-readable and bar-code forms, and create a paper label for the CD jewel box that identifies the technical report and connects it with the serial number on the CD. For this purpose, Mitchell Charity developed a web page form that accepts a report number, recording date, and CD serial number, and returns an appropriate PostScript label with matching human-readable and bar-code serial numbers for printout and insertion in the CD jewel box. These labels are printed on acid-free paper and because they do not touch the CD blank, we anticipate that they will not compromise its longevity.

4. Continued Dissemination

A second primary area of transition planning that we pursued during the year is to arrange for continuing dissemination over the Internet, via Dienst and FTP, of about 50 Gigabytes of processed, display-ready and print-ready page images.

The server that provides public dissemination of the reports consists of an IBM RS-6000 computer acquired for this project; it contain 512 MegaBytes of RAM and something over 50 Gigabytes of disk storage. We have been working on a plan that involves transfer of this computer to MIT Information Services (IS) and that the library and IS jointly take over its operation and care of the data so as to provide continued dissemination of the page images. This proposal is under consideration by both organizations.

In connection with both preservation and continued dissemination, we inquired of the plans of the other four CS-TR project participants. We have not heard from Carnegie-Mellon, but Stanford and Berkeley have both been working out arrangements that these two jobs should be transferred to their respective libraries. At Cornell, the Computer Science department is taking responsibility for both jobs following the end of the CS-TR funding.

5. Research Activities

5.1 Citation accumulation, deduplication, and automatic linking.

Jeremy Hylton completed his Master of Engineering Thesis in February. Here is the abstract of the finished thesis:

"Bibliographic records freely available on the Internet can be used to construct a high-quality, digital finding aid that provides the ability to discover paper and electronic documents. The key challenge to providing such a service is integrating mixed-quality bibliographic records, coming from multiple sources and in multiple formats. This thesis describes an algorithm that automatically identifies records that refer to the same work and clusters them together; the algorithm clusters records for which both author and title match. It tolerates errors and cataloging variations within the records by using a full-text search engine and an n-gram-based approximate string matching algorithm to build the clusters. The algorithm identifies more than 90 percent of the related records and includes incorrect records in less than 1 percent of the clusters. It has been used to construct a 250,000-record collection of the computer science literature. This thesis also presents preliminary work on automatic linking between bibliographic records and copies of documents available on the Internet."

Jeremy has made a demonstration of this system available on the World-Wide Web, at the URL

    http://cslib.lcs.mit.edu/cslib/

As indicated in the abstract, deduplication is done in two passes, a gross comparison into potential clusters using a full-text search engine, followed by a fine comparison of items within each potential cluster using a string adjacency algorithm. As the collection of citations grew in size to its current 100 Megabytes, it became necessary to switch the search engine used in gross record comparison from Glimpse to the heavier-duty Library 2000 search engine, which is designed to deal with bodies of text in the Gigabyte size range.

For the fine matching pass, simple string comparison was too intolerant of spelling and typographical errors, so Jeremy wrote a string comparison program that uses n-grams (with n set to 3, so it is actually using trigrams) to do approximate comparisons of strings. The library computes a vector magnitude for the difference between any two strings by making a list of all trigrams that appear in either string and deleting from that list all trigrams that appear in both strings. The residual list (which may include repeated trigrams) is treated as a vector with one dimension for each different trigram and length along that dimension equal to the number of times the trigram appears. The program calculates the length of this difference vector and compares it with a threshold that itself is calculated from the vector length of the original strings. If the difference magnitude is less than the threshold the two strings are declared to be similar, otherwise not.

The result of the two passes is an automatic grouping of the raw citations into about 160,000 clusters of closely related citations (the same paper cited several different ways, or published in different places). The result of a query to this system is a list of composite records, one for each cluster that contain things that satisfy the query. Each composite record merges information from the various underlying citations, with different merge strategies for different fields. In case the user decides the item described in the composite record is of interest, it also contains links to whichever repositories are likely to contain an on-line version (based on the types of documents in the underlying raw records) and to nearby library catalogs that may lead to physical copies. The system also provides the ability to look at the underlying raw records, if there is any question.

Jeremy also studied false matches made by the duplicate detection algorithm. False matches occur when the matching algorithms place two actually unrelated records in the same cluster. Since most uses of clusters would merge clustered records together, such false matches would effectively hide one or the other of the falsely matched records, or lead to creation of a misleading composite record. To mitigate this problem, he investigated various methods of identifying those sets of duplicate records that contain an unusally wide variation in the source records.

The thesis is available in Postscript, gzipped Postscript, and text (without the figures) and as LCS Technical Report MIT/LCS/TR-678. Links to each of these forms can be found under "publications" and then "theses" on the Library 2000 home page, . The Digital Index for Works in Computer Science demonstrates the results of the thesis. It is an index of 255,000 bibliographic records, which have been processed to identify records that describe the same work. A link to that demonstration will be found in the same place.

We have also performed a significant act of technology transfer; after completing his thesis, Jeremy moved to the Baltimore/Washington area and is now on the technical staff at out sponsor, the Corporation for National Research Initiatives.

5.2 Document Tracking System

Over the last year Jeremy Hylton, with much input from the scanning operations team, has developed a Document Tracking System with a World-Wide Web interface. This system helps us manage the scanning operation by identifying major milestones in the scanning process, providing a shared database that records when those milestones have been reached, and giving World-Wide Web browsers access to the current state of the database. The two currently-used views of the database are a detailed log of action on a particular technical report or a table of all milestones achieved for each TR in a particular series.

The summary of milestones for an entire series is a particularly valuable way to track scanning progress and identify potential problems. Each milestone, e.g. scanning completed, original document returned, etc., is displayed horizontally across the table, along with the date the last milestone was reached. This view helps identify reports that have stalled somewhere in the scanning process -- for example, a report that was scanned but never processed for distribution. The table summary also makes it possible to identify which reports have been scanned -- and which reports have been skipped for one reason or another.

The tracking system has been in production use at Document Services since January, and we incorporated it into our processing programs this spring, but we have not described it in any detail until now.

When we first designed the system, we identified 7 milestones that each document would reach as it was moved through the scanning operation:

Original located

Original delivered to Document Services

Report scanned and taped

Original returned

FTP from Document Services and processed for quality control

Quality control #1: The scanned images were compared with a printed copy of the document to verify they were scanned correctly.

Quality control #2: The image processing routine correctly created the images for distribution and placed them in our on-line delivery site. When each milestone is reached, a user (or user program) must check off that entry in the database. The database records who checked off a particular milestone and when they did so.

We also included some extra fields in the database:

an error checkoff that would flag the report as being a problem
a comment field to describe errors or other issues more fully
a rough page count (from the bibliographic citation) for use in generating reports
the id of the tape the report was stored on

In practice, we have not updated all of the checkoff fields. (It is not clear if we identified milestones that did not need to be checked off or if we have failed to follow our intended process completely.) One of the problems may be that each milestone is an item that must be marked as completed, but some steps may only be interesting as steps that can not be completed: If we assume that all originals can be located, then perhaps the "Original located" milestone should actually be a field that is only marked when an original can *not* be located.

The database and the Web interface were implemented in Perl, and use both a DBM file for fast access and a transaction log for long-term reliability.

The system includes a library of Perl functions for performing queries across the database, updating entries, and displaying results. The library is intended to make it easier to develop custom reports based on the database contents -- for example a list of all the reports that have reached the "scanned and taped" milestone along with their page counts.

We did not use an off-the-shelf database program because we were not familiar with any affordable and simple systems that would allow multiple users to read from and write to the database concurrently. Our own implementation was relatively straightforward because of the nature of the database:

each milestone is checked off only once, and nothing is deleted (this database requires just append-only or monotonic-change semantics, which simplifies coordination)
writes to the database occur infrequently (about 100 per week)
it is unlikely that more than one user would try to check off a single milestone

5.3 Method and apparatus for retrieving information from a database.

Mitchell Charity and Eytan Adar filed a patent with the above title on the technique that they jointly developed to do an approximation of nearest-match search using repeated invocations of a boolean search engine with randomly-chosen terms from the original search string. Because the patent is in the area of Information Retrieval, in which none of the group has special expertise, there was considerable skepticism that the idea could possibly be new. However, consultation with a few experts in that field suggested that it may actually be a novel idea, so after some agonizing, M. I. T. proceeded with a patent application.

5.4 Replication

Jerry Saltzer and Mitchell Charity continued to work on design of algorithms for failure detection and accurate repair of an append-only database stored as geographically-separated complete replicas. In talks at Stanford University, the Xerox Palo Alto Research Center, and the Digital Systems Research Center, Jerry Saltzer described a set of algorithms and the considerations in their design. This work is still in progress.

TALKS, PUBLICATIONS, THESES, AND PATENTS:

Jerome H. Saltzer. Replication for long-term persistence. Talk given at Stanford University Digital LIbrary group (October 25, 1995), Xerox Palo Alto Research Center (October 24, 1995), and Digital Equipment Corporation Systems Research Center, Palo Alto (October 26, 1995).

Gregory Anderson. The Library and the Network: Barriers, benefits, and outcomes. Seminar presented at the University of Massachusetts, Amherst, MA., December 6, 1995.

Publications:

Gregory Anderson, Rebecca Lasher, and Vicky Reich. The Computer Science Technical Report (CS-TR) Project: Considerations from the Library Perspective. Stanford University Technical Report CS-TR-95-1554, and M. I. T. Laboratory for Computer Science Technical Report 693. (A revised version of this report was also published in the on-line journal Public Access Computer Systems Review 7(2), 1996. )

Jeremy Hylton. Access and Discovery: Issues and Choices in Designing DIFWICS. DLIB magazine (March, 1996 issue) Corporation for National Research Initiatives.

Theses:

Jeremy A. Hylton. Identifying and Merging Related Bibliographic Records. Master of Engineering thesis, M. I. T. Department of EECS, June, 1996 (actually completed in February, 1996). To appear as LCS Technical Report MIT/LCS/TR-678.

Patents filed:

Mitchell Charity and Eytan Adar. Method and apparatus for retrieving information from a database. Filed June, 1996.

Last Updated: 17 August 1996 Return to Library 2000 home page.