The basic hypothesis of the project is that the technology of on-line storage, display, and communications will, by the year 2000, make it economically possible to place the entire contents of a library on-line and accessible from computer workstations located anywhere. The goal of Library 2000 is to understand the system engineering required to fabricate such a future library system. The project's vision is that one will be able to browse any book, journal, paper, thesis, or report using a standard office desktop computer, and follow citations by pointing--the thing selected should pop up immediately in an adjacent window.
Our research problem is not to invent or develop any of the three evolving technologies, but rather to work through the system engineering required to harness them in an effective form. The engineering and deployment of large-scale systems is always accompanied by traps and surprises beyond those apparent in the component technologies. Discovering relevant documents, linking archival materials together (especially across the network), and maintaining integrity of an archive for a period of decades all require new ideas.
This is the fifth and final reporting year of the project. The first four annual reports laid out the vision, described an overall architecture, and reported the development of a production scanning system. This year we report on the wrap-up of the project, including plans for preservation of the acquired data and transition of the technology, services, and acquired data to more permanent institutions.
Our overall scanning strategy is to locate the earliest-generation originals that are available, and extract the maximum amount of information possible with production-capable hardware. Two scanners can acquire 400 pixels per inch with 8 bits of grey scale; this resolution produces image files that are quite large, about 15 Megabytes per scanned page, thus requiring careful organization of storage and workflow. The scanning system acquires about 4 Gigabytes of raw data each day. Following each day's work, unattended overnight jobs archive the raw data to Digital Audio Tape and transfer it out of the scanning station to a server site for additional processing before storing it for distribution. This post-scan processing consists in reducing the data to an agreed-upon standard interchange form: 600 pixels per inch, one bit per pixel, using CCITT Group IV FAX compression in TIFF format, and also to a 100 pixel per inch, 5-bit form for fast display and quality control. These two forms remain on-line; the archive DAT retains the high-resolution data for future use. Near the end of the year, we began transferring the raw data from archive DAT to recordable CD, a more permanent storage medium.
Scanning production reports pages pages/ Gbytes week archived Cumulative total, June 30, 1995 499 29,430 400 July 1-September 30, 1995 106 13,987 1,166 October 1- December 30, 1995 246 19,538 1,628 January 1- March 31, 1996 146 16,980 1,415 April 1-May 30, 1996 53 8,246 1,030 Cumulative total, May 30, 1996 1,050 88,181 1,320
Note that twice as many pages were scanned this year as in the two preceding years combined. This increase confirms that the major production rate increases achieved during the 1995 reporting year were sustained this year. The number of reports scanned did not increase quite as rapidly, because removing of production bottlenecks allowed us to scan longer reports, which we had been previously avoiding.
Since scanning operations reached full stride, as mentioned above, near mid-year we decided to turn our attention to a problem we had deliberately deferred: scanning the oldest reports in the LCS and AI archives, dating from the mid-1960's. These reports are challenging for several reasons. First, they have been in storage longest and have endured more traumatic experiences than have recent materials. Second, they are generally not printed on acid-free paper, so they are beginning to turn yellow. (Fortunately, deterioration is not so advanced that they crumble when handled.) Finally, they are printed with older technologies (hectograph, mimeograph, and early offset) so there is a wide variation in appearance. As expected, the older materials slowed down production significantly. In addition, some of the poorly-printed materials required hand adjustment or post-scan touchup to be readable. After verifying that hand touch-up was feasible, we decided to defer further work on those materials, because they were absorbing too much personnel time.
The next step of the operation is post-scan processing, reducing and converting the image form and placing the processed form on-line for public access. In addition to images obtained from the production scanning operation, we receive authoritative PostScript for most new technical reports, and create image forms for them by direct conversion.
Reports placed on-line: reports pages Cumulative total, June 30, 1995 219 13,690 July 1-September 30, 1995 66 7,200 October 1- December 30, 1995 145 9,000 January 1- March 31, 1996 139 15,900 April 1-May 30, 1996 101 15,733 Cumulative total, June 30, 1996 570 61,523
The difference between the cumulative total of on-line reports and the cumulative total of scanned reports (from the previous table) is that the scanned images for about 500 reports currently reside only on archive DAT; these are being systematically reloaded and run through the on-line processing system.
Quality control reviews of scanned materials are part of the standard workflow; the review process is finding a small number of problems with the scanned images, but for the most part it appears that the procedures in use produce materials that pass review.
In our standard scanning workflow, the high-resolution image data is processed down to a lower-resolution form for on-line delivery, display, and printing with current technology, and the original high-resolution image data is copied to Digital Audio Tape (DAT) for future use when on-line storage capacities and processing speeds increase. Since DAT has an expected media lifetime of only a few years, and a probable technology lifetime of no more than a decade, longer-term preservation of the data requires transfer to a more stable medium.
Near the end of the year, we concluded that recordable CD (CD-R) has a suitable balance of properties for preservation: the increasingly popular use of CD-ROM's for computer applications ensures that reading technology will be available for a long time, and studies suggest that the media itself has a possible lifetime of 70 to 100 years. The lifetime requirement at hand is probably only a decade or two, but no other off-line medium provides an intermediate capability, CD-R costs are relatively low, and the projected lifetime provides a margin of safety. We therefore began transfering the data from DAT to CD-R, using a Yamaha/APS 4X CD recorder attached to the Macintosh PowerPC 8500. For recording software we are using Astarte "Toast CD-ROM Pro", version 3.0.
A second white paper, from TDK, describes heat-accelerated life tests on cyanine-dye disks, and the third white paper, from Kodak, describes heat-accelerated life tests on phthalocyanine-dye disks. These studies suggest that cyanine-dye disks may have a lifetime of 70 years or more if they are stored in suitably dark conditions, and that phthalocyanine-dye disks may have a lifetime of 100 years or more and are less sensitive to bright light.
Anecdotal evidence from on-line news groups suggests that some vendors of cyanine-dye blanks have had runs of bad media. Taking all these considerations into account, we have chosen to adopt phthalocyanine-dye disks as a standard. This choice currently restricts us to two original equipment vendors, Mitsui and Kodak. Since Kodak disks are manufactured with an extra layer of protective lacquer and with a machine-readable bar-code containing the disk sequence number, we have specified the use of Kodak blank media.
The goal of these details is that each CD be completely self-describing, and we have in addition attempted to provide enough information that if an error in workflow or a bug in the software has muddled the data, a future reader has a reasonable chance of figuring out what happened and possibly compensating. We have verified that the CD's are readable, the file systems are mountable, the text files can be read with any standard editor, and the image files can be opened and displayed properly on a Macintosh, an AIX/UNIX system, and a DOS/Windows 3.1 system.
Approximately 2,600 blank CD's will be required to complete the preservation project. Our plan is that the Document Services office will continue to perform restores from DAT and recording to CD as a background activity in the months after the contract with CNRI has expired. To this end, we attempted, in the last months of the project, to use our remaining funds to acquire blank media. As it happens, a world-wide shortage of blank CD-R media developed just before we began that acquisition, and it was necessary to do considerable shopping to locate supplies. We now have commitments from suppliers that appear to be sufficient to exhaust available funds. Approximately 1600 CD blanks are either in hand or scheduled for delivery; we intend to ask the Labaoratory for Computer Science for funds for to acquire the remaining blanks.
In response to these labeling problems, we chose to leave the CD's unlabeled. Instead, we use CD blanks that come with unique manufacturer-supplied serial numbers in both human-readable and bar-code forms, and create a paper label for the CD jewel box that identifies the technical report and connects it with the serial number on the CD. For this purpose, Mitchell Charity developed a web page form that accepts a report number, recording date, and CD serial number, and returns an appropriate PostScript label with matching human-readable and bar-code serial numbers for printout and insertion in the CD jewel box. These labels are printed on acid-free paper and because they do not touch the CD blank, we anticipate that they will not compromise its longevity.
The server that provides public dissemination of the reports consists of an IBM RS-6000 computer acquired for this project; it contain 512 MegaBytes of RAM and something over 50 Gigabytes of disk storage. We have been working on a plan that involves transfer of this computer to MIT Information Services (IS) and that the library and IS jointly take over its operation and care of the data so as to provide continued dissemination of the page images. This proposal is under consideration by both organizations.
In connection with both preservation and continued dissemination, we inquired of the plans of the other four CS-TR project participants. We have not heard from Carnegie-Mellon, but Stanford and Berkeley have both been working out arrangements that these two jobs should be transferred to their respective libraries. At Cornell, the Computer Science department is taking responsibility for both jobs following the end of the CS-TR funding.
"Bibliographic records freely available on the Internet can be used to construct a high-quality, digital finding aid that provides the ability to discover paper and electronic documents. The key challenge to providing such a service is integrating mixed-quality bibliographic records, coming from multiple sources and in multiple formats. This thesis describes an algorithm that automatically identifies records that refer to the same work and clusters them together; the algorithm clusters records for which both author and title match. It tolerates errors and cataloging variations within the records by using a full-text search engine and an n-gram-based approximate string matching algorithm to build the clusters. The algorithm identifies more than 90 percent of the related records and includes incorrect records in less than 1 percent of the clusters. It has been used to construct a 250,000-record collection of the computer science literature. This thesis also presents preliminary work on automatic linking between bibliographic records and copies of documents available on the Internet."
Jeremy has made a demonstration of this system available on the World-Wide Web, at the URL
http://cslib.lcs.mit.edu/cslib/As indicated in the abstract, deduplication is done in two passes, a gross comparison into potential clusters using a full-text search engine, followed by a fine comparison of items within each potential cluster using a string adjacency algorithm. As the collection of citations grew in size to its current 100 Megabytes, it became necessary to switch the search engine used in gross record comparison from Glimpse to the heavier-duty Library 2000 search engine, which is designed to deal with bodies of text in the Gigabyte size range.
For the fine matching pass, simple string comparison was too intolerant of spelling and typographical errors, so Jeremy wrote a string comparison program that uses n-grams (with n set to 3, so it is actually using trigrams) to do approximate comparisons of strings. The library computes a vector magnitude for the difference between any two strings by making a list of all trigrams that appear in either string and deleting from that list all trigrams that appear in both strings. The residual list (which may include repeated trigrams) is treated as a vector with one dimension for each different trigram and length along that dimension equal to the number of times the trigram appears. The program calculates the length of this difference vector and compares it with a threshold that itself is calculated from the vector length of the original strings. If the difference magnitude is less than the threshold the two strings are declared to be similar, otherwise not.
The result of the two passes is an automatic grouping of the raw citations into about 160,000 clusters of closely related citations (the same paper cited several different ways, or published in different places). The result of a query to this system is a list of composite records, one for each cluster that contain things that satisfy the query. Each composite record merges information from the various underlying citations, with different merge strategies for different fields. In case the user decides the item described in the composite record is of interest, it also contains links to whichever repositories are likely to contain an on-line version (based on the types of documents in the underlying raw records) and to nearby library catalogs that may lead to physical copies. The system also provides the ability to look at the underlying raw records, if there is any question.
Jeremy also studied false matches made by the duplicate detection algorithm. False matches occur when the matching algorithms place two actually unrelated records in the same cluster. Since most uses of clusters would merge clustered records together, such false matches would effectively hide one or the other of the falsely matched records, or lead to creation of a misleading composite record. To mitigate this problem, he investigated various methods of identifying those sets of duplicate records that contain an unusally wide variation in the source records.
The thesis is available in Postscript, gzipped Postscript, and text
(without the figures) and as LCS Technical Report MIT/LCS/TR-678.
Links to each of these forms can be found under "publications" and
then "theses" on the Library 2000 home page,
We have also performed a significant act of technology transfer; after completing his thesis, Jeremy moved to the Baltimore/Washington area and is now on the technical staff at out sponsor, the Corporation for National Research Initiatives.
The summary of milestones for an entire series is a particularly valuable way to track scanning progress and identify potential problems. Each milestone, e.g. scanning completed, original document returned, etc., is displayed horizontally across the table, along with the date the last milestone was reached. This view helps identify reports that have stalled somewhere in the scanning process -- for example, a report that was scanned but never processed for distribution. The table summary also makes it possible to identify which reports have been scanned -- and which reports have been skipped for one reason or another.
The tracking system has been in production use at Document Services since January, and we incorporated it into our processing programs this spring, but we have not described it in any detail until now.
When we first designed the system, we identified 7 milestones that each document would reach as it was moved through the scanning operation:
We also included some extra fields in the database:
The database and the Web interface were implemented in Perl, and use both a DBM file for fast access and a transaction log for long-term reliability.
The system includes a library of Perl functions for performing queries across the database, updating entries, and displaying results. The library is intended to make it easier to develop custom reports based on the database contents -- for example a list of all the reports that have reached the "scanned and taped" milestone along with their page counts.
We did not use an off-the-shelf database program because we were not familiar with any affordable and simple systems that would allow multiple users to read from and write to the database concurrently. Our own implementation was relatively straightforward because of the nature of the database:
Gregory Anderson. The Library and the Network: Barriers, benefits, and outcomes. Seminar presented at the University of Massachusetts, Amherst, MA., December 6, 1995.
Jeremy Hylton. Access and Discovery: Issues and Choices in Designing
DIFWICS. DLIB magazine (March, 1996 issue) Corporation for National
Research Initiatives.
Theses:
Jeremy A. Hylton. Identifying and Merging Related Bibliographic
Records. Master of Engineering thesis, M. I. T. Department of EECS,
June, 1996 (actually completed in February, 1996). To appear as LCS
Technical Report MIT/LCS/TR-678.Patents filed:
Mitchell Charity and Eytan Adar. Method and apparatus for retrieving
information from a database. Filed June, 1996.