Annual Progress Report

Library 2000
July 1, 1994--June 30, 1995

Academic Staff

  • Jerome H. Saltzer
  • Research Staff

  • Mitchell N. Charity
  • Graduate Students

  • Jeremy A. Hylton
  • Andrew J. Kass
  • Undergraduate Students

  • Eytan Adar
  • Gillian D. Elcock
  • Geoff M. Lee Seyon
  • Yoav O. Yerushalmi
  • Reading Room Staff

  • Paula Mickevich
  • Maria T. Sensale
  • M.I.T. Library Staff

  • T. Gregory Anderson
  • Michael W. Cook
  • Lindsay J. Eisan, Jr.
  • Keith Glavash
  • Justin Waite
  • Support Staff

  • Michael Malone
  • Christine M. Molnati
  • Introduction

    Library 2000 is a computer systems research project that is exploring the implications of large-scale on-line storage using the future electronic library as an example. The project is pragmatic, developing a prototype using the technology and system configurations expected to be economically feasible in the year 2000. Support for Library 2000 has come from grants from the Digital Equipment Corporation; the IBM Corporation; ARPA, via the Corporation for National Research Initiatives; and uncommitted funds of the Laboratory for Computer Science.

    The basic hypothesis of the project is that the technology of on-line storage, display, and communications will, by the year 2000, make it economically possible to place the entire contents of a library on-line and accessible from computer workstations located anywhere. The goal of Library 2000 is to understand the system engineering required to fabricate such a future library system. The project's vision is that one will be able to browse any book, journal, paper, thesis, or report using a standard office desktop computer, and follow citations by pointing--the thing selected should pop up immediately in an adjacent window.

    Our research problem is not to invent or develop any of the three evolving technologies, but rather to work through the system engineering required to harness them in an effective form. The engineering and deployment of large-scale systems is always accompanied by traps and surprises beyond those apparent in the component technologies. Discovering relevant documents, linking archival materials together (especially across the network), and maintaining integrity of an archive for a period of decades all require new ideas.

    This is the fourth reporting year of the project. The first three annual reports laid out the vision, described an overall architecture, and reported the development of a production scanning system. This year we report on improvements in the production scanning operation, including addition of quality control. We also have made progress on representation of scanned documents, automatic linking mechanisms, and archival replication. In conjunction with the other members of an informal consortium, known as the CSTR group, we have made our scanned images available to the public using the Dienst server software developed by Jim Davis and Carl Lagoze of Cornell University. The other members of the CSTR group are Stanford University, the University of California at Berkeley, Carnegie-Mellon University, Cornell University, and the Corporation for National Research Initiatives.

    High Resolution Production Scanning

    One of our cornerstone hypotheses is that scanned images of documents will be an important component of many future electronic libraries. Last year, together with the Document Services group of the M. I. T. Library System, we developed and began to use a production scanning system. This year we moved from experimental scanning to real production, and worked our way past several bottlenecks.

    The scanning station currently consists of a network-attached Macintosh Quadra 840AV equipped with 64 Megabytes of RAM, 12 Gigabytes of disk, four DAT drives, two SCSI channels, and two SCSI-attached scanners: a Hewlett-Packard IICx and a Fujitsu 3096G. The software used with the Fujitsu scanner is a package named Optix, from Blue Ridge Technologies, Inc.; with the HP scanner we use a package named Cirrus, from Canto Technologies, together with a workflow manager written in AppleScript. Production work is done with the Fujitsu scanner, and the HP scanner is intended for exceptional items such as color photographs. Adjacent to the scanning station is an IBM RS/6000 workstation configured as a quality control display system, and a nearby Hewlett-Packard model 4SiMx, 600 dpi printer also supports quality control.

    Our overall scanning strategy is to locate the earliest-generation originals that are available, and extract the maximum amount of information possible with production-capable hardware. The two scanners can acquire 400 pixels per inch with 8 bits of grey scale; this resolution produces image files that are quite large, about 16 Mbytes per scanned page, thus requiring careful organization of storage and workflow. At the current rate of scanning, we are acquiring about 4 Gigabytes of raw data each day. Following each day's work, an unattended overnight job archives the raw data to Digital Audio Tape and transfers it out of the scanning station to a server site for additional processing before storing it for distribution. Currently, this additional processing consists in reducing the data to an agreed-upon standard interchange form: 600 pixels per inch, one bit per pixel, using CCITT Group IV FAX compression in TIFF format, and also to a 100 pixel per inch, 5-bit form for fast display and quality control. These two forms remain on-line; the archive DAT retains the high-resolution data for future use.

    Production Rate

    A major emphasis this year has been to increase the scanning rate by removing bottlenecks from the scanning process. The average scanning rate at the beginning of the year was about 150 pages per week and at the end of the year about 1250 pages per week. Our progress can be measured in the following table:
    Scanning production                     reports      pages      pages/    
    Gbytes archived
                                                                     week
      Cumulative total, June 30, 1994          40         3900                  40
    
            July 1-September 30, 1994          62         2161       165
         October 1- December 30, 1994          86         2800       215
            January 1- March 31, 1995         146         8440       650
                April 1-June 30, 1995         165        12086       930
    
      Cumulative total, June 30, 1995         499        29430                 400
    

    Improvements in scanning rate have involved several steps, many seeming mundane or straightforward but together they illustrate the intricacy and complexity of a large-scale scanning activity. The major ones were:

  • physical paper transport considerations were formalized. Forms were developed for document and process control, and after some refining they are in use between the Laboratory for Computer Science, the Artificial Intelligence Laboratory, and Document Services. A regular schedule for document pick up and transport was devised to keep things flowing.
  • Measurement and analysis of network packet flow to determine why file transfer was slower than expected. With the help of a packet trace and special versions of the file transfer program provided by its Australian author, we established that the network hardware and network drivers of the Macintosh operating system are actually very highly tuned, allowing it to drive a 10 Mbit/sec Ethernet at essentially full capacity. On the other hand, the shared file system is exceptionally clumsy and turns out to be the bottleneck on large-file transfers. A new Macintosh operating system update, called System Update 1, provided some major relief in this area, with the result that our file transfer performance increased by about a factor of four, to nearly 2 Mbit/sec. However, dividing 4 Gbytes/day by 2 Mbits/sec indicates that 16,000 seconds/day (about 5 hours) are still required for the overnight transfer of that much data--and then only if nothing goes wrong.
  • development of unattended operation scripts to do archiving to tape and file transfer overnight.
  • expansion from one to four tape drives, to allow overnight archiving without changing tapes.
  • expansion of disk storage from 4 Gbytes to 12 Gbytes, to hold a full day's scanning and allow for occasional mishaps during unattended archiving and file transfer.
  • automating of post-scan processing. In addition to initiating file transfer operation, this processing includes converting the 400 dpi 8-bits/pixel originals to 600 dpi 1 bit/pixel form suitable for laser-printing and placement on the server, and also to 100 dpi 5 bits/pixel for quality control. These operations, previously done manually, are now automatically triggered each night by the presence of completely scanned documents on the scanning station. The post-scan processing system is now fully automated, and is designed to permit intensive quality control.
  • Our next target is 2000 pages per week, but a difficult bottleneck that has recently emerged is the frequency of paper feed jams. In many cases the originals of the TR's have warped slightly during storage, or for some reason stick together, leading to jams in the automatic paper feeder of the scanner. We have seen this problem coming for some time, and have tried a number of strategies to deal with it, so far without success.

    During the year, we began placing the scanned technical reports on-line for public access. In addition to images obtained from the production scanning operation, we are now receiving authoritative PostScript for many new technical reports, and are converting those to image form by direct conversion. The number currently available are:

      post-scan processing                    157         9300
      postscript conversion                    58         4600
                 total                        215        13900
    

    These numbers indicate that not all scanned reports have been placed on-line. Images are scanned during the day, and that night, during unattended operation we make archive copies on Digital Audio Tape (DAT) and we transfer the image files by FTP to be processed for on-line placement. However, if for some reason unattended FTP is not successful within the next 24 to 48 hours, we have chosen to delete the scanned images from disk to make room for the next day's scanning--typically we generate four to six gigabytes of raw data per day--rather than hold up scanning until post-processing is complete. This choice allowed us to attack separately scanning bottlenecks and post-processing bottlenecks. The strategy seems to have been successful, as shown by the rapidly increasing production rate. By year's end, individual reports of up to four gigabytes (250 pages) were being FTP'ed and processed. The miss rate has been declining, and missed reports will eventually be retrospectively restored from DAT and processed.

    Some lessons learned in scanning

    As expected, we have found that the oldest technical reports are also the most problematical to scan. First-generation originals are missing from the paper files, ink has faded and paper has yellowed, and other strange problems crop up with much higher frequency on thirty-year-old documents. Since emphasis this year was on increasing the scanning production rate, we have modified our strategy to work with most recent documents backward, rather than oldest documents forward. Once we have achieved a satisfactory production rate we expect to introduce old documents into the production line to get more insight into the disruption they cause.

    As scanning volume increased, it was necessary to install larger disk drives, so two 4-Gigabyte disk drives were acquired for the scanning station. To use these larger drives properly required installation of Mac Operating System 7.5, which in turn raised a number of compatibility problems with other software. Resolution of these problems took careful analysis and interfered with production for about two months, but the overall impact of the change was a major increase in the production scanning rate. In addition, larger technical reports, up to 250 pages in size, can now be accommodated without the special handling that was required when they had to be split up across several disks.

    The larger disk space has allowed us to abandon compression on scanned images before archiving them to DAT. Instead, the raw, unprocessed bit-map images (15 Mbytes per page) are now being preserved in the off-line medium. The use of compression in archiving has been a point of considerable discussion. In this case, the time spent compressing the images was larger than the time saved in archiving, so the overall production rate increased when we abandoned compression. In addition, we have always been concerned about increased fragility that comes with removing redundancy from archived materials. The discovery that some previously archived and compressed images, when retrieved and decompressed, were unopenable as images convinced us that our concern for fragility was well-placed.

    During the year we had three failures of 4-Gbyte drives. All were replaced under warranty; as best we can tell from inspection of the replacements, on the first two failures the vendor increased the air flow through the drive cabinets to reduce their internal temperature; the third failure was of a power supply. Although we have been careful to install only disk drives that have had several months of field experience elsewhere, that is not enough to ensure that all design problems have been found. It would be better to use drives that have 18 months or more of field experience, but the disk drive business currently is changing too rapidly for that approach to be economic. We believe that the six-month experience accumulation that we currently require represents the best current trade-off between up-to-dateness and seasoning, though a production activity that did not have the same level of local technical expertise might find it more effective to insist on a longer accumulation of field experience.

    Finally, near the end of the year, as part of the scanning operation we have instituted a quality control stage. Several group members formed a task force to develop quality control criteria and workflow, and at the end of the year a first draft description of the quality control activity was under review. Currently images, file transmission, and processing activity are being monitored. Mechanisms for reporting problems, recording of quality control completion, and archiving the result are being investigated.

    Other Activities Related to Scanning

    Document Scanning Record

    Last year's annual report mentioned that work was underway on defining a document scanning record. This year, the production scanning operation began preparing document scanning records for each document. A document scanning record contains in a standard, computer- and human- readable form the conditions under which the scanning was done--hardware, software, identity of the report, name of the operator, resolution and other hardware and software settings, and a map that lists each image file and a descriptor of its contents. (E.g., "image file 14 contains report page number 7".) The scan record is prepared by the scanning operator while the scan is taking place, using an Excel spreadsheet, which turned out to provide both a good template/default form mechanism and also a quick way of building the map. The form of a scanned document now consists of a folder containing:
         the document scanning record
         a text specification of the document scanning record
         a set of image files
         a file containing sizes and checksums of all the other files
    

    The checksum file is intended to be merged into the document scanning record, but current software tools aren't up to the job, so for the moment it is a separate file.

    The document scanning record has gone through a number of versions, and is likely to continue to be refined as more experience is gained in its use and its ability to support archiving and document browsing. The current version is identified as CSTR.1.3, and a specification of its contents can be found in

    http://ltt-www.lcs.mit.edu/ltt-www/Public/scanrec/scanrec.html

    The scanning record is one of the inputs to the document definition and browsing system described later in this report.

    Document tracking system

    As production scanning went into full gear it became apparent that we needed something better than a paper logging system to keep track of what has been done, what is in the pipeline, and what remains to be done. Jeremy Hylton developed a simple document tracking system using a World-Wide Web form page as the interface to drive a small database program. The database has been primed with a list of all known AI and LCS technical reports, and as they are located in the archives they begin a series of checkoffs as they arrive at the scanning station, are scanned, are returned to the archives, go through post-scan processing, and through quality control. This system was placed into production operation and is now our primary reporting mechanism. As the tracking system went into production, we performed an inventory of archive tapes and archive records and compared them with the scanning logs to make sure that everything is accounted for. As expected, there were a number of confusing things in the records, primarily because we have changed procedures several times as we learned how to get organized to do scanning.

    Dienst at M. I. T

    We had for some time been running an infrastructure-level, local implementation of Dienst, Cornell's scanned page document server. At Cornell's request, we attempted to install the full Cornell implementation. Jeremy Hylton found that the first distribution package did not build properly in our environment, and Carl Lagoze of Cornell, from the information learned from these installation attempts, prepared a new release. Dienst is now fully operational and delivering our technical reports to the public.

    Research Activities

    An Interchange Standard and System for Browsing Digital Documents

    An M. Eng. thesis by Andrew J. Kass provides a complete definition of a MIME type for a "digiment", an on-line document consisting of scanned images or other displayable form. The scheme used differs from that used by Cornell's Dienst in that meta-information about a document is provided as an architected part of the document object, rather than as data in response to protocol inquiries. We think that Andrew's approach may be more universal and more easily evolved than the alternative. The following two sections, from the introduction to Andrew's thesis, provide more background.

    Abstract: With the advent of fast global digital communication networks, information will increasingly be delivered in electronic form. In addition, as libraries become increasingly more computerized, not just card catalogs but entire books will be stored on-line. When delivering digital documents from on-line libraries, two problems must be addressed; delivering a document which may be available in multiple formats and preserving relationships among the different formats. This thesis describes a system for delivering digital documents, called digiments, which can be in multiple arbitrary formats, while preserving the relationships between the parts and meta-data about the document itself.

    The Need for a Digital Document: For immediate transfer of data, a plethora of methods, formats and protocols may be used, both for the transfer, as well as for the content itself. However, for transferring between different systems, or transferring a document that may have content in arbitrary or multiple formats, a standard way of representing a document is needed. In addition, documents have additional information that is not necessarily reflected in the content of the document, such as copyrights, publisher, ISBN number, etc. Furthermore, this information, even if present, needs to be available in an architected way for displaying. This "meta-information" needs to travel with the content of the document, regardless of the format the content is in.

    By defining a digital document and creating a standard for its transmission, it is possible to solve these problems. By allowing the content to be in any format, but having the digital document describe the format, it is possible to use any format, present or future, for the content itself. By creating a standard way to describe the document's meta-information and content format, it is possible to transmit the document without having to worry about what format it is stored in. Finally, documents can be stored and delivered in multiple formats, making it possible to view both the scanned images and ASCII text of a document.

    Digital Documents: In order to create on-line documents, we need to define what a digital document is. In this model, a digital document is data in arbitrary format accompanied by meta-information, which is part of a document, but not necessarily contained in the document itself. In order to easily transfer digital documents, I am creating a set of MIME [1] types, which describe the format of a document and contain the meta-information, while still allowing arbitrary data formats.

    What exactly do we mean by a document? A document can be a book. It can be a magazine or periodical. It can be a bibliography of a technical field. A technical report is a document. Instruction manuals, TV listings, newspapers, rulebooks, and catalogs can all be considered documents. What do we mean by a digital document, or a document stored on a computer? A digital document can be composed of a collection of images in GIF, JPEG, TIFF or other format. It can be a Postscript, LaTeX or ASCII text file or set of files. It can be a set of images, the ASCII text, and a text to image map that relates the two. It can include sound. In short, there is no easy definition for a digital document. Nor do digital documents correspond well to what we call documents in the physical world.

    For these reasons, we need to create a new type of object, called a digiment, for digital document. Defining a digiment will allow us to achieve some of the goals expressed above. It will provide a standard way to pass around digital documents, regardless of the format in which the content itself is stored. It will provide a standard for archiving and retrieving digital documents. It will provide for the use of multiple representations of a single "document" that can be linked together, as with a text to image map. It is intended to accommodate the use of future data formats transparently. And it will provide meta-information about a document, such as author, publisher, and copyrights. For comparison, in a book, the text on the pages of the book is the content, while the meta-information is found on the inside of the cover page, such as ISBN number, copyright and publisher.

    The digiment standard is not a specific data format for storing or transmitting the content of digital documents. Rather, it is a container for transmitting or storing documents in arbitrary formats. A digiment consists of the data itself, which can be in any form, along with some associated meta-information structural information. This additional information is what differentiates a digiment from a simple set of images or data files. The structural information specifies the format that the data itself is stored in, whether it is image, text, Postscript, etc. It also specifies how the different data parts of the digiment are related to each other. The meta-information includes bibliographic information about the digiment, such as the author, publisher, copyrights, distribution rights, and relationships between different formats. By specifying the digiment in this way, it is possible to use any type of format for storing the actual data, including formats which may be created in the future.

    The completed thesis has been deposited in the M. I. T. library, is currently available on the World-Wide Web at the URL

    http://ltt-www.lcs.mit.edu/ltt-www/People/delphi/thesis.html

    and is available as an M. I. T. Laboratory for Computer Science Technical Report, MIT-LCS-TR-653.

    Deduplication and the Use of Meta-Information in the Digital Library

    A second M.Eng. thesis was undertaken by Jeremy Hylton, to explore methods of handling the problem that a search for information usually provides several leads to what seems to be the same thing; deduplication of these leads is important to make a search system usable. The next four paragraphs are from the introduction of that thesis proposal.

    The explosive growth of the World-Wide Web over the last two years underscores the realization that the technology exists -- or will soon exist -- to build the digital libraries envisioned since the earliest days of computing. The experience of using the Web also makes it clear that linking and resource discovery are complicated problems, at least in part because of the sheer quantity of data available.

    Resource discovery is difficult not only because there is so much data to sift through, but also because so many varying copies of resources exist. When performing a search with Archie, an index of anonymous ftp sites, a query for a particular application will often discover dozens of copies of the same resource and return a long search result set that contains little information.

    One technique that can be used to manage the discovery and location of large quantities of data is deduplication. Exactly as the name suggests, deduplication is the process that narrows a list of resources by eliminating any duplicate references to a resource.

    Deduplication is an interesting problem because it is difficult to decide when two resources are duplicates. Consider two possible uses of a digital library: A student searches for a copy of Hamlet in the library; the digital library actually has many copies of Hamlet -- some by different editors, some in different collections -- but the student only wants to know that a copy of Hamlet is to be had. For the student, deduplication of the list to one or two entries provides a valuable simplification. On the other hand, a Shakespeare scholar may want to find a copy of one of the early printed editions of Hamlet. For that scholar, deduplication needs to work within a narrow sense of equality. The scholar does not need to know that there are three libraries with copies of the first quarto, but does need to know that copies of the first and second quartos exist.

    The complete proposal may be found at
    http://ltt-www.lcs.mit.edu/ltt-www/Public/Proposals/jeremy-thesis.html
    As part of this thesis, Jeremy generated a new set of Technical Report bibliographic records from the publications databases. The new records are an improvement over the previous batch because the publications databases have been improved. For example, the AI Laboratory records now include authors' full names instead of last name and initial; we also may have information about grants and retrieval. This work was undertaken so as to have a body of material on which to perform deduplication experiments. Jeremy also did some preliminary work on a tool for finding duplicate bibliographic records in the MIT card catalog and in the CSTR database. The tool manipulates MARC records in OCLC format and RFC 1357 records and produces merged RFC 1357 records as output. The thesis is expected to be completed during the fall of 1995.

    Automatic link generation

    Jeremy Hylton developed a better demonstration of automatic hypertext link construction. Building on earlier work on text-image maps, he created a Web-based browser that constructs text-image maps on the fly by loading data from Dienst servers. In addition, it recognizes gross features such as the table of contents and the bibliography, and calculates an offset between page numbers and image numbers. The complete demonstration now allows one to approach a scanned technical report by starting from a list of (automatically generated) links to particular points in the document such as the first numbered page, the table of contents, or the bibliography. Looking, for example at the scanned image of the table of contents, one can select a page of interest and jump to that page using a link that is generated on demand from the selection point. Looking instead at a citation of another Technical Report found in the bibliography, one can select it, invoke a search, confirm the result, and ask to see the second report. A paper describing this work, co-authored with Eytan Adar, was submitted to Digital Libraries '95.

    The complete demonstration, usable from any modern World-Wide Web browser, can be found at

    http://ltt-www.lcs.mit.edu/ltt-www/Public/TIMap/demo

    Navigation service

    Andrew Kass completely overhauled and upgraded the data management service that underlies our prototype CS-TR navigation system that is implemented on top of the Domain Name Service. There are currently two navigation services in operation. Each one takes a CSTR ID as its input argument:

    1. ID.url.ltt-ns.lcs.mit.edu returns a URL to the place that images of the specified document can be found, if images exist.

    2. ID.formats.ltt-ns.lcs.mit.edu returns a list of formats in which the document is available.

    The navigation management system now polls the five CS-TR sites for new documents once a week. If a particular site is down at the time of the poll, it keeps the previous week's data for that site and skips updating for that week. Retaining the old data also happens if a server suddenly "loses" many documents; that is, if the new list of documents is significantly smaller than the previous list.

    Andrew also developed a World-Wide Web demonstration of the navigation service. This demonstration includes both an overview and technical description of how the navigation service works and what it is used for. The demo allows one to look up any Technical Report which is on-line using the RFC-1357 ID. It then returns the URL of the Technical Report, as well as the formats available. In addition, it shows the actual DNS queries used to obtain the information. There is also a second demonstration which shows the practical use of the Navigation service to find out information as opposed to querying the server for the Technical Report each time; the former is more than an order of magnitude faster. This demonstration can be found at

    http://ltt-www.lcs.mit.edu/ltt-www/Public/navigate/navigate.html

    CSTR Architecture definition

    A series of weekly teleconferences run by Jim Davis, and with the participation of William Arms, Judith Press, and John Garrett, and Jerry Saltzer, developed first-cut definitions, scenarios, and a specification of an architecture for a digital library. Parallel to that effort, the MIT research group created a draft set of definitions and scenarios. Both the MIT architecture description and the CSTR joint architecture description that Jim Davis is coordinating are work still in progress.

    Replication architecture

    Mitchell Charity, with other members of the group, is continuing work on a replication architecture for very long term storage, on meta-data representation, and on naming and infrastructure architecture. The three are closely related. People wish a variety of properties of their information storage and distribution environment - low latency, broad replication, consistency, high security, ease of creation and modification, archival stability. Their requirements are inescapably in tension. While good system design can minimize trade-off costs, this divergence drives system complexity. To contain this complexity, we have been forced in a direction of specializing services (such as very long term, `no other properties matter' storage), maintaining their independence so diverse data needs can be met, and creating cohesion with simple naming services and extensible data typing. This separation of concerns seems more promising than, and has helped clarify the weaknesses of, approaches which mix naming, typing, and storage.

    A current hypothesis of our replication research is that the underlying stored objects should be thought of as immutable. Immutability gives the opportunity to use a long checksum of the data itself as the unique name of the object, for example as the opaque string that appears in its URN. Mitchell developed an experimental system that names objects in this way. In addition this experimental system provides a testbed in which we can launch various bit-rotting demons and watch how well the detection and repair algorithms cope.

    Unified CS-TR interface

    Gillian Elcock implemented an extension to the testbed system that allows it to interface more gracefully with the World-Wide Web. We previously reported building a World-Wide Web interface to the reading-room catalog of technical reports and also a WAIS source for our union catalog of CS-TR bibliographic records. The unified interface fits under these two services; whenever a search turns up a technical report of interest, the unified interface checks the Navigation system described above to see whether or not scanned images are available for that report. If so, it fabricates on the fly a WWW page that contains one button for each page image, thus providing a kind of simple browser for use from the Web.

    Circulation system

    Geoff Lee Seyon developed and deployed an on-line document circulation system for the Library 2000 testbed system in use at the LCS/AI reading room. The system records an e-mail address when materials go out, and it sends overdue notices to that address. When this new system was put into operation, it was discovered to be alarmingly effective. The number of checked-out materials that were returned was so large that the reading room had to order additional shelves--it had previously been depending on patrons to hold a substantial part of its collection. (This system is not a mainline component of the research project; we did it to maintain the ability to use the LCS/AI reading room as a trial venue for mainline projects.)

    CS-TR Library Cooperation

    Greg Anderson has been working with librarians from other CS-TR participating institutions, primarily Stanford, to articulate and discuss digital library issues from a library service and support perspective. Toward that end, he, Rebecca Lasher, and Vicky Reich of Stanford have prepared a Technical Report to be published jointly by M. I. T. and Stanford on this topic. The original outline of the Library issues: sustainability, growth, migration from research to production, inclusion of other content beyond CS-TR's, and integration with new and existing library electronic services was presented at the winter 1994 CS-TR meeting in Palo Alto, CA, and revisted at the June 1995 meeting in Berkeley.

    Publications and Theses

    Published papers

    Marilyn McMillan and Greg Anderson. "The Prototyping Tank at MIT: Come on in, the Water's fine", CAUSE/EFFECT, vol. 17, number 3 (Fall 1994), 51-54.

    Eytan Adar and Jeremy Hylton. On-the-fly Hyperlink Creation for Page Images. Proceedings of the Second International Conference on the Theory and Practice of Digital Libraries, College Station, Tex., June 11-13, 1995.

    Theses

    Andrew Kass. An Interchange Standard and System for Browsing Digital Documents. M. Eng. thesis, M. I. T. Dep't. of EECS, May, 1995. Also available as Technical Report MIT-LCS-TR-653.

    Theses in progress

    Jeremy Hylton. Deduplication and the Use of Meta-Information in the Digital Library. M. Eng. thesis, M. I. T. Dep't. of EECS. Expected date of completion, January, 1996.


    Last Updated: 9 August 1995    Return to Library 2000 home page.