-----------------------------------------------------------------------
|                                                                     |
| PROPOSAL TO DIGITAL EQUIPMENT CORPORATION EXTERNAL RESEARCH PROGRAM |
|                                                                     |
-----------------------------------------------------------------------

 Research Institute:  Massachusetts Institute of Technology
      Organizations:    -  Laboratory for Computer Science, Library 2000
                        -  Library System, Distributed Library Initiative

     Research Title:  LIBRARY 2000 STORAGE SYSTEM
    
               Date:  June 1, 1993
    
      Investigators:  Jerome H. Saltzer, Professor of Computer Science
                      M. I. T.  Laboratory for Computer Science
                      Room NE43-513, Tel (617) 253-6016, e-mail Saltzer@mit.edu

                      T. Gregory Anderson, Assoc. Dir. for Systems and Planning
                      M. I. T. Library System
                      Room 14S-216, Tel (617) 253-5654, e-mail ganderso@mit.edu

     Administration:  M. I. T. Office of Sponsored Programs
                      Paul C. Powell, Coordinator
                      Room E19-719, Tel (617) 253-3856

    Digital Sponsor:  Howard Webber, Director of Advanced Development,
                      Workgroup Systems, ZK02-2/T63, Tel. (603) 881-0772
                      e-mail webber@a1vax.enet.dec.com

  Funding Requested,
       from Digital:
          Equipment & Licenses @ Manufacturer's List Price:       $505,411
          Allowance Requested (Expressed in Dollars):              505,411
          Allowance Requested (Expressed as a % of MLP):              100%
          Other resources:                                            none

 from other sources:
          Digital (Previous grant)                                $200,000
          Corporation for National Research Initiatives,
             Reston, Virginia:   (10/1/92--6/30/95)              1,521,496
          IBM Corporation (hardware grants at list price)          400,000

  Research Abstract:  Design and construct a triply-replicated image
                      storage system for the Library 2000 testbed, to enable
                      understanding of problems of modularity, linking,
                      persistence, and image storage service semantics. 

       Deliverables:  Quarterly and annual progress reports, technical
                      reports, and papers.  Digital will also receive
                      the benefit of early and close contact with the
                      entire Library 2000 project, including system
                      architecture research, index server protocols,
                      user interface experiments, and Internet access
                      to a production-quality demonstration library
                      consisting of all M. I. T. computer science
                      technical reports and theses.

Technology Transfer Plan:
                      1.  Copies of project theses, papers, and reports.
                      2.  Demonstration access to the testbed system.
                      3.  Talks and seminars as appropriate.
                      4.  Space and facilities for one person from
                          Digital to work with us.


LIBRARY 2000 STORAGE SYSTEM                                    June 1, 1993

Motivation for the proposed research:

Within ten years, the manufacturing cost of magnetic disk storage will
have dropped by two orders of magnitude from 1993 levels.  This
reduction in price will enable many information-dominated applications
that at present would be ridiculously uneconomic.  If we wait until
this low-priced storage technology becomes available, application
designers will then stumble on the many unsolved and difficult systems
engineering problems that surround the use of large-scale storage.
Our research goal is that when that low-priced storage does arrive,
the underlying systems engineering will be ready for those
information-dominated applications.

One application area that lower storage costs will soon enable is the
on-line electronic library.  As the cost of magnetic disk storage
drops by the first factor of ten, it will become competitive in
storage cost with paper, and ten times smaller than paper in physical
volume.  These changes will undoubtedly trigger much interest in the
technology.  The second factor of ten will make on-line storage much
cheaper than paper and 100 times smaller in volume, a result that
should generate a powerful driving force toward large-scale
deployment.  Therefore this research project proposes to begin work on
understanding the engineering problems by building a prototype of an
on-line electronic library.

The appropriate architecture of a large-scale, entirely on-line,
electronic library is not understood.  While current ad hoc approaches
can provide access to specific large data bases, we believe that
longer-term increases in scale will require that access architecture
be systematically organized with several additional issues, requiring
research, in mind.  Three specific research problems we intend to
address are appropriate modularity, data integrity, and data linking.
An explanation of these research problems appears in the next section
of this proposal.



LIBRARY 2000 STORAGE SYSTEM                                    June 1, 1993

Description of the Research Project:

A small group of research universities, currently consisting of
Carnegie-Mellon, Stanford, the University of California at Berkeley,
Cornell University, and M. I. T., is engaged in trying to create a
working, useful prototype of an on-line, electronic library consisting
of the technical reports of their respective Computer Science
laboratories.  The LIBRARY 2000 STORAGE SYSTEM will be the storage
component for the M. I. T. part of this prototype electronic library
system.  In addition to M. I. T. computer science technical reports,
it will also contain student theses in computer science.

The goals of the project are:

1.  to obtain early experience with a core function of the
    distributed electronic library of the future,

2.  to explore the architecture, design, and work-flow issues
    associated with making information available in digital form.

3.  to work within the research/prototype domain with a volume of
    information large enough to be useful and interesting and that can
    scale to an operational system,

4.  to work with a database that is readily available, that has a
    critical time-sensitive value, and that is already well-known and
    valued by its target audience,

5.  to provide an important service to an audience of researchers,
    faculty, and students who are motivated and likely to have access
    to appropriately powerful workstations to use the library from
    their offices, so that we can observe user reactions.

This project is organized around two time frames.  In the short term,
we intend to make available an on-line resource of computer science
technical reports and theses.  For the longer term we intend to
develop an architecture that can scale up to deal with an on-line
library of any size.  The research component of the project thus
involves discovering solutions to these longer-term problems in the
context of the short-term delivery system.  As mentioned in the
previous section, three specific research problems will be addressed
with the implementation of the LIBRARY 2000 STORAGE SYSTEM:
appropriate modularity, data integrity, and data linking.

    - Appropriate modularity.  It is not yet clear exactly what
functional division is appropriate in distributing presentation
management, indexing, storage, and collection management across
distinct network-attached components.  This modularity extends to the
organizational model for collection management--what segments of this
system/process are best centralized and what are best distributed;
what are the architectures necessary to ensure effective and viable
handling of collections and bibliographic control.  The client/server
model is a helpful starting point, but it does not provide guidance as
to which components should retain what kinds of state.  We believe
that robustness is best achieved if the server maintains the minimum
amount of information about its clients, and we will test this
hypothesis.  The client/server model also provides no guidance as to
how to exploit disk and RAM costs that will soon be 100 times lower
than they are today.  We are starting with two technology hypotheses:
that the projected price of magnetic disk storage 10 years hence will
make page image storage on that medium much cheaper than storage on
paper, and that the projected price of RAM storage 10 years hence will
make it clearly appropriate to place full-text (or any other kind of)
indexes entirely in RAM.  Our research hypothesis is that the storage
component is best organized as a distinct network service, independent
of indexing services and the presentation client, with a specialized
protocol for update, browsing, and for an indexing service to learn
what materials have been recently added (the "what's-new?" interface).

    - Data integrity.  There is a challenge in maintaining integrity
in a system that can potentially accumulate terabytes of material, in
the form of millions of files, over tens of years.  Copying that much
data accurately when a technology becomes obsolete and needs to be
replaced is already a challenge; keeping that much data accurate in
the face of media failures requires an approach very different from
traditional tape backup systems.  We intend to explore techniques of
multiple replication at sites that are widely separated, on the
hypothesis that this approach will prove much better-matched to the
problem than are traditional backup system designs.  To our knowledge,
although many workers have explored both the theory and practice of
replication, that work has had the goal to improving availability and
performance in the short term; noone has seriously proposed using
replication in place of traditional, complex, full- and
incremental-backup systems.  We expect that the volume of information
in a digital library will be sufficiently large that the rate of decay
in place may be comparable to the rate of change and addition of new
information; thus we expect to provide an ongoing process that
continually reviews the data in each replicant for integrity and makes
any needed repairs by reference to the other replicants.
Interestingly, adding such a process to the system makes updates and
additions to the database a challenge--the integrity-preserving
process may throw out updates unless the update procedure is carefully
coordinated with it.  We intend to design and implement in our testbed
a coherent integrity-preserving system based on replicants.

    - Linking.  A link is a cross-reference, placed in a data object,
to another data object, which may be elsewhere in the network.  What
makes links challenging is that the cross-reference may not be invoked
until several years later, and during the intervening time the target
object may have been involved in physical, logical, or administrative
reorganizations, and the target document itself may have been updated,
reclassified, moved, combined with others, or republished.  In
addition, since one may collect cross-references from many sources, it
is important to be able to figure out which ones are duplicates.  It
is our hypothesis that some combination of unique identifiers and
names will be required to meet these requirements.  The Internet
Engineering Task Force has recently begun work on a concept called the
Universal Document Identifier that is intended to tackle a significant
part of the cross-reference problem.  Several other candidate ideas
for handling cross-reference have been suggested in related contexts
such as the World-Wide Web and the Wide Area Information Service.

As suggested by the hypotheses, ideas are in hand for tackling each of
these problem areas, but none of these hypotheses has yet been tried
out on anything approaching the needed scale.


LIBRARY 2000 STORAGE SYSTEM                                    June 1, 1993

Milestones:

0.  Initial testbed implementation.  Already in service.

1.  Collection of page images begins.  Summer, 1993, immediately
    following installation of hardware.

2.  Implementation of "get" (reading) part of storage service protocol.
November, 1993

3.  Implementation of "put" (update) and "what's-new" (for index
services) parts of storage service protocol.  June, 1994.

4.  First experimental implementation of replication system.  June, 1994.

5.  Implementation of production version of replication system.  June, 1995.


Deliverables:

Much of the code developed for this project will, for flexibility, be
implemented using fast prototyping systems, so we do not view the code
itself as a deliverable--although we intend to make it available to
anyone interested in learning from it.  The deliverables will comprise
technical reports, theses, progress reports, and published papers
describing the theoretical, architectural, technical, operational, and
managerial aspects of a testbed of a future library system.

Digital will also receive the benefit of early contact with the entire
Library 2000 project, which extends beyond the storage service to
include its general system architecture, index server protocols, and
user interface prototypes.

M. I. T. will grant to Digital unlimited permission to use the testbed
system, including the index and stored database of M. I. T. computer
science technical reports and theses, via the Internet, for the
duration of the project.


Technology Transfer Mechanisms:

1.  Theses, published papers, technical reports, and progress reports,
as described in the list of deliverables.

2.  Demonstration access to the testbed system as described in the
list of deliverables.

3.  The Library 2000 team will be available for talks and seminars at
Digital sites, to be scheduled as appropriate.

4.  We will make available space and facilities for one person from
Digital Engineering to work with us for the duration of the project.



LIBRARY 2000 STORAGE SYSTEM                                    June 1, 1993

Current Status:

This proposal is part of a larger project, named Library 2000, to
develop a testbed system that explores the future of the digital
electronic library.  This project began about eighteen months ago with
a cash grant from Digital, a hardware grant from IBM, and
discretionary research funds from M. I. T.  Those grants allowed an
initial testbed to be designed and implemented and the preparation of
a $1.5M proposal to the Corporation for National Research Initiatives,
to participate in a DARPA-sponsored cooperative project involving five
universities.  That proposal has been approved in principle, and is in
the final stages of negotiation of contract details.

Within Digital, the project has been discussed for about two years,
within the External Research Program, and also with Howard Webber, of
the Workgroup Systems Group.  In addition, several people from
Digital's research laboratories are familiar with the project.  These
include Jim Miller (Cambridge Research Laboratory) and Chuck Thacker
(Systems Research Center).  The proposal has been prepared with the
help of our Digital account representative, Andrea Hauser.

A paper describing the overall vision of Library 2000 was presented in
Paris in November, 1992, at an INRIA 25th anniversary conference.  Two
related workshop papers have also been presented, describing problems
and prospective solutions on the subjects of backup, indexing, and
linking.  The current testbed is in service on the Internet, providing
bibliographic information for the LCS/AI reading room and,
experimentally, for the M. I. T. Library System.  It can be sampled
from any X workstation on the internet.


LIBRARY 2000 STORAGE SYSTEM                                    June 1, 1993

Qualifications of the Researchers and of the Institution:

1.  Jerome H. Saltzer, Co-Principal Investigator, Professor of
Computer Science, M. I. T. Laboratory for Computer Science.  Professor
Saltzer has been active in the field of computer systems research
since 1961.  He developed a user interface to the file-sharing
facilities of the Compatible Time Sharing System, one of the earliest
word processing systems, the interprocess communication features and
some of the security features of the Multics computer system, an early
token ring local area network, and the first network package for
desktop personal computers.  He was Technical Director of M. I. T.
Project Athena, a system of 1000 engineering workstations and servers
used for undergraduate education.

2.  T. Gregory Anderson, Co-Principal Investigator, Associate Director
for Systems and Planning, M. I. T.  Libraries.  He is a coordinator of
the Distributed Library Initiative, a five year M. I. T. umbrella
program to revolutionize electronic library services.  He has
published articles on a variety of library topics, including
electronic publishing and the role of the library in the social
creation of knowledge.  He is an M. I. T. representative to the
Coalition for Networked Information.  He has been active in
academic/research librarianship for 12 years.

3.  Mitchell N. Charity, Research Staff, M. I. T. Laboratory for
Computer Science.  In cooperation with Professor Saltzer, Mr. Charity
has been the chief designer and implementor of the current Library
2000 testbed.

4.  M. I. T. and its Laboratory for Computer Science have a number of
facilities that are especially adapted to carrying out this research.
Chief among these is MITNet, a pervasive medium-bandwidth (10
MBits/sec) network that connects the 1000 workstations of Project
Athena, their servers, and desktop computers in virtually every office
within the Laboratory and elsewhere at M. I. T.  MITNet is connected
with high-bandwidth links to the Internet, which allows effective
cooperation in research with other organizations; the proposed storage
service will be delivered via Internet to clients at M. I. T., other
cooperating project participants and to interested parties at Digital.
The M. I. T. Library system, a partner in this project, provides a
repository of information that is large enough in size and broad
enough in scope that it can serve well as a test database for this
project.  In particular, it will be able to provide a source for those
computer science student theses that have never been published as
technical reports.  In addition, the MicroReproduction Laboratory of
the Library System will develop facilities for scanning documents; these
facilities will be the primary data input mechanism for the project.


LIBRARY 2000 STORAGE SYSTEM                                    June 1, 1993

Configuration:

The requested hardware comprises three Alpha servers, each equipped
with 22 Gbytes of disk storage, which will be used to implement a
triply-replicated storage system, and six Alpha workstations for
development and experimental access to the storage system.  One of
the three servers is equipped with spare TurboChannel slots, to allow
for future attachment to a proposed high-performance network in
Technology Square.

Proposed Configuration (all prices are Standard, as of April 22, 1993):

one PE511 DEC 3000 500S AXP server with
        64 Mbyte RAM
        2 1.0 Gbyte disks
        OSF/1, base, server, multiuser, and developer's licenses
        3 SCSI-2 dual turbochannel cards
        7 BA-350 disk storage enclosures, containing
           22  1.0 Gbyte disks
                                                                $139,911

two PE411 DEC 3000 400S AXP servers with
        64 Mbyte RAM
        2 1.0 Gbyte disks
        OSF/1, base, server, multiuser, and developer's licenses
        3 SCSI-2 dual turbochannel cards
        7 BA-350 disk storage enclosures, containing
           22  1.0 Gbyte disks and 1 DAT drive
                                             2  @ 121,250       $242,500

six PE301 DEC AXP workstations with
        32MB RAM,
        1 1.0GB disk
        19" color display
        Audio Headset
        OSF/1 base, multiuser, and developer's licenses
                                             6  @ 20,280         120,930

associated documentation and media                                 2,100

                                       Total cost               $505,441

A complete and detailed quotation has been prepared by Andrea Hauser.

Return to Library 2000 home page.