Library 2000 Prospectus

LIBRARY 2000
A research prototype of the on-line electronic library of tomorrow.

by Jerome H. Saltzer
October 31, 1991

Abstract:  The technology of on-line storage, display, and
communications will, by the year 2000, make it economically possible
to place the entire contents of a library on-line, accessible from
computer workstations located anywhere.  The goal of Library 2000 is
to understand and try out the system engineering required to fabricate
such a future library system from the underlying technologies.

The Vision:  One can browse any book, journal, paper, thesis, or
report in the library using a standard office desktop computer, and
can follow citations by pointing; the thing selected pops up
immediately in an adjacent window.

-----

Some details.

1.  Why?  Digital storage in the year 2000 will be 4 orders of
magnitude cheaper, and therefore at least 4 orders of magnitude larger
than it was ten years ago.  We need to start now to figure out how to
manage that much data.  Finding things, linking things together
(especially across the network), and keeping it reliable will all
require new ideas.

2.  Why now?:  The three technologies have improved markedly,
recently.

    - magnetic disk storage cost/bit and absolute size.  Within the
      decade one will be able to hold a page-image library entirely
      online at reasonable cost in both hardware and floor space.

    - data communications, both campus-sized and nationwide, now or
      soon will permit moving page images from the library to the
      workstation in a human reaction time, again at reasonable cost.

    - display is at a threshold where reading bitmapped images will become
      acceptable.

and, in addition, the client/server model has matured to the point
where it is directly applicable.

3.  The interesting system support design problems.

    - persistence of storage.  Current systems aren't designed to
      handle data that is meant to be retained for several decades,
      times that are one or two orders of magnitude greater than the
      lifetime of the storage media, compression techniques, and forward
      error correction techniques, not to mention representation
      methods.  How can one update media technology underneath
      Terabytes of storage with confidence that information hasn't
      been damaged?  For information that isn't regularly used (e.g.,
      the least-used 50% of a library's collection) what are good
      tradeoffs among the number of copies, reliability, and
      geographical dispersion?  The project will explore what is
      required to make a media refresh subsystem a standard component
      of an operating system.
 
    - backup.  For reliability, there must be more than one copy of
      the data, but traditional backup methods involving full and
      incremental copies made to tape do not appear to scale up well
      in size and they are notoriously complex and error-prone.
      The project will explore the hypothesis that one can provide the
      necessary reliability with geographically separated multiple
      copies, plus a change log, which allows recovery from mistakes
      in updating.

    - representation.  There appear to be many possibilities, e.g.,
      scanned images, ASCII, PostScript, SGML, FAX Group IV, etc., but the
      requirement of storing the data for decades seems to reduce the
      field drastically.  The project will explore the hypothesis that
      the right thing to do is to collect scanned images and
      minimally-tagged ASCII for every object.

    - organizational cooperation.  Different organizations will be the
      natural providers of different items.  New relations of inter-
      dependence among players must be worked out.  Who takes primary
      responsibly to store what?  Who stores secondary/backup copies.
      Where does one put the caches?  Which bits are authoritative?
      The project will examine this question by actively cooperating
      with several other organizations doing similar work.

    - applying the client-server model.  the world of libraries has
      several requirements that act to inhibit application of
      technology; it appears that a network of multiple, dedicated
      servers and user-owned desktop workstations meets several of
      those requirements and provides a natural solution to the
      problems.  Some traditional problems that the client-server
      model seems to help with:
          - the presentation device has been owned by the library, not
            the customer, thus limiting location and ubiquity.
          - the circulation management system, which needs fast
            response, has had to operate in the same computer as the
            bibliographic search program, which soaks up lots of
            cycles and degrades response to other activities.
          - libraries do bibliographic cataloguing of books, while
            for-fee services do it for journals.  In both cases the
            bibliographic material duplicates some information found
            in the item catalogued.  And the two worlds use independent,
            different, access methods.
          - The user needs to see a distinction between the two search
            outcomes "the library doesn't have it" and "can't identify
            it from what you said".  Present systems merge those
            outcomes into a single "search failed".
          - monolithic library systems make any change a big deal;
            client/server components can plug together like a hi-fi system,
            allowing modular replacement.
          - interlibrary cooperation is at arms-length, yet much of any
            collection duplicates other collections.  The client/server
            model makes cooperation technically straightforward.
          - one needs different search techniques for different
            collections.
          - one wants to search several public collections with a
            single request.  And to extend a search to private
            collections without the need to first register their
            existence.
      Each of these concerns appears superficially to be well
      addressed with a client/server model, but field experience with
      a real design is needed to see if it actually works.  At the
      same time, several additional problems aren't automatically
      solved by the client server model, and require further ideas:
 
       - identification.  Given a citation, how to locate the object cited.
         Involves several interacting components--just what object is it,
         how is it identified, what library holds it, how do we get our
         hands on it.  What part of this is done at the time a document is
         archived, what part is done at the time a link is followed?

       - user interface.  How to provide a simple, intuitive model of
         what is going on, especially if multiple collections are being
         searched.

       - cross-representation coordination.  How does one relate a
         mouse gesture on a displayed image to a particular set of
         words in the ASCII form?

       - coordination of multiple, possibly variant copies of information.
         How does one discover that two things from different collections are
         actually the same object, or a minor variation?

       - allowing for authorship and reviewed publication of links, so
         that users can concentrate on competent links.

       - allowing new ideas for search to fit in gracefully.

      The project will explore this area by developing a candidate
      client/server architecture and building it to see how well it
      actually works.


4.  Frequently-mentioned problems that are left for others to work on.

    - advanced search ideas (relevance and relatedness measures, image
      and structure search, etc.)
    - advanced user interfaces; bright ideas for improving navigation,
      or providing better arrival and departure context.
    - back-office automation improvements.  (circulation/overdue,
      acquisitions, serials control, cataloguing, interlibrary loans, etc.)
    - ideas to deal with the probable dislocation of revenue flows.

    (The proposed system would provide a good laboratory in which
     others can explore innovative proposals in those problem areas.)


5.  The method.

    - build a prototype electronic library, with provision for
           -  bibliographic records
           -  abstracts
           -  full text
           -  page images
           -  simple keyword-based search
           -  multiple public collections
           -  unregistered private collections
           -  multiple-copy coordination
           -  interface to traditional library backoffice support systems
           -  a simple, portable, X-based presentation interface

    -  stock it, on a small scale
           -  AI/LCS reading room bibliographic records
           -  Computer science bibliographic records extracted from Barton
           -  AI and LCS Technical Report abstracts
           -  AI and LCS TR text (obtained by canvasing old WP files)
           -  AI and LCS TR page images

    -  investigate various directions
           - how would it scale to an M.I.T.-sized library
                 (2M items)
           - what parts would be applicable to a library the size of
             the Library of Congress (60M items)
           - try out storage architectures ideas intended to provide
             multiple-decade persistance
           - try out replication for backup
           - explore placement of compression and forward error
             detection inside the file system.

    -  begin stocking on the next larger scale
           - collect machine-readable abstracts of TR's from as many
             other Computer science organizations as possible.
           - put online images, and if possible text, of recent M.I.T. theses
           - do the same for all recent M.I.T. Technical reports
           - make the full M.I.T. library catalog available
           - upgrade two Athena clusters with displays suitable for
             this application.

6.  Time scale.

    Three to five years.


7.  Opportunities for cooperation.

    Within M.I.T....

    - The M.I.T. Library System wants to move in this direction once
      it is proven, and has offered cataloguing data for experiments,
      willingness to get into the scanning business and opportunities
      to help guide future system plans.

    - The M.I.T. Industrial Liaison Program, a major distributor of
      M.I.T.-originated Technical Reports, would like to explore
      making an on-line collection available to its corporate clients,
      and is willing to provide another place to test the framework.

    - The M.I.T. Technology Licensing Office is providing support both in
      setting policy for access to M.I.T. materials and in working out
      copyright assignment problems.

    - The Advanced Network Architectures group of the M.I.T.
      Laboratory for Computer Science wants to develop novel protocol
      suites for cross-reference of information at different network
      sites.  The requirements of the electronic library will
      influence the infrastructure they intend to develop and
      vice-versa.

    - M.I.T. Project Athena will provide its environment as an
      experimental application delivery vehicle.

    Beyond M.I.T....

    - A cooperative venture is underway with a staff member at DEC CRL
      to develop jointly the protocol that links an information client,
      a storage service, and an indexing service.

    - CMU (Project Mercury) already has working a similar framework, with
      more emphasis planned on object-oriented document structure.
      Has offered to exchange both software and ideas.

    - Thinking Machines (WAIS project) has another similar framework
      in operation, with more emphasis on search and document relations.
      Has offered to exchange both software and ideas.

    - DEC SRC has offered to exchange access to online Technical
      Reports, and would provide an opportunity to learn how to make
      things work across filtering gateways.

    - A consortium of CMU, UC/Berkeley, Stanford, and M.I.T. has
      proposed to exchange scanned computer science technical reports
      and related bibliographic information, providing another opportunity
      to work with multiple collections.

8.  Resources required.

    - a library willing to act as a prototype.  (The LCS/AI reading
      room is the current library of choice.  M.I.T. theses and TR's
      involve the M.I.T. Library System as part of a later step.)

    - Hardware.
       - storage. (total ~3.6 Terabytes, to be organized as a triply
                   replicated 1.2 Terabyte system, for persistent
                   image storage.)  Fifteen servers each initially
                   equipped with twenty 1.2 Gbyte disks.  To be later
                   upgraded to twenty 12 Gbyte disks each.
       - indexing. Nine high-performance servers each equipped with
                   three 1.2 Gbyte disks and 128 Mbyte RAM.
                   (for index and search engines.)
       -  network.  10 Mbit/sec to start, can use upgrades as they
                   become available.
       -  display.  Ten workstations for development use.  Later,
                   upgrades for 80 Athena workstations, for delivery.
       -  misc:  cut page scanner, two laser printers.

    - People.
       -  65% of one faculty member as supervisor.
       -  one professional staff programmer.
       -  two graduate student research assistants.
       -  six undergraduate research students.
       -  75% of a FTE library staff person for cataloguing and assuming 
          novel collection management burdens.
       -  25% of a support secretary.

    - Space.
       -  Office space for the listed people. (4 or 5 offices.)

    - Miscellaneous.
       -  Software acquisition (search, user interface, changes to
                    vendor systems) 
       -  Travel
       -  System and Network maintenance
       -  Materials & services

    -  Overhead.
       -  M.I.T. 
       -  L.C.S.
Return to Library 2000 home page.