LIBRARY 2000
A research prototype of the on-line electronic library of tomorrow.
by Jerome H. Saltzer
October 31, 1991
Abstract: The technology of on-line storage, display, and
communications will, by the year 2000, make it economically possible
to place the entire contents of a library on-line, accessible from
computer workstations located anywhere. The goal of Library 2000 is
to understand and try out the system engineering required to fabricate
such a future library system from the underlying technologies.
The Vision: One can browse any book, journal, paper, thesis, or
report in the library using a standard office desktop computer, and
can follow citations by pointing; the thing selected pops up
immediately in an adjacent window.
-----
Some details.
1. Why? Digital storage in the year 2000 will be 4 orders of
magnitude cheaper, and therefore at least 4 orders of magnitude larger
than it was ten years ago. We need to start now to figure out how to
manage that much data. Finding things, linking things together
(especially across the network), and keeping it reliable will all
require new ideas.
2. Why now?: The three technologies have improved markedly,
recently.
- magnetic disk storage cost/bit and absolute size. Within the
decade one will be able to hold a page-image library entirely
online at reasonable cost in both hardware and floor space.
- data communications, both campus-sized and nationwide, now or
soon will permit moving page images from the library to the
workstation in a human reaction time, again at reasonable cost.
- display is at a threshold where reading bitmapped images will become
acceptable.
and, in addition, the client/server model has matured to the point
where it is directly applicable.
3. The interesting system support design problems.
- persistence of storage. Current systems aren't designed to
handle data that is meant to be retained for several decades,
times that are one or two orders of magnitude greater than the
lifetime of the storage media, compression techniques, and forward
error correction techniques, not to mention representation
methods. How can one update media technology underneath
Terabytes of storage with confidence that information hasn't
been damaged? For information that isn't regularly used (e.g.,
the least-used 50% of a library's collection) what are good
tradeoffs among the number of copies, reliability, and
geographical dispersion? The project will explore what is
required to make a media refresh subsystem a standard component
of an operating system.
- backup. For reliability, there must be more than one copy of
the data, but traditional backup methods involving full and
incremental copies made to tape do not appear to scale up well
in size and they are notoriously complex and error-prone.
The project will explore the hypothesis that one can provide the
necessary reliability with geographically separated multiple
copies, plus a change log, which allows recovery from mistakes
in updating.
- representation. There appear to be many possibilities, e.g.,
scanned images, ASCII, PostScript, SGML, FAX Group IV, etc., but the
requirement of storing the data for decades seems to reduce the
field drastically. The project will explore the hypothesis that
the right thing to do is to collect scanned images and
minimally-tagged ASCII for every object.
- organizational cooperation. Different organizations will be the
natural providers of different items. New relations of inter-
dependence among players must be worked out. Who takes primary
responsibly to store what? Who stores secondary/backup copies.
Where does one put the caches? Which bits are authoritative?
The project will examine this question by actively cooperating
with several other organizations doing similar work.
- applying the client-server model. the world of libraries has
several requirements that act to inhibit application of
technology; it appears that a network of multiple, dedicated
servers and user-owned desktop workstations meets several of
those requirements and provides a natural solution to the
problems. Some traditional problems that the client-server
model seems to help with:
- the presentation device has been owned by the library, not
the customer, thus limiting location and ubiquity.
- the circulation management system, which needs fast
response, has had to operate in the same computer as the
bibliographic search program, which soaks up lots of
cycles and degrades response to other activities.
- libraries do bibliographic cataloguing of books, while
for-fee services do it for journals. In both cases the
bibliographic material duplicates some information found
in the item catalogued. And the two worlds use independent,
different, access methods.
- The user needs to see a distinction between the two search
outcomes "the library doesn't have it" and "can't identify
it from what you said". Present systems merge those
outcomes into a single "search failed".
- monolithic library systems make any change a big deal;
client/server components can plug together like a hi-fi system,
allowing modular replacement.
- interlibrary cooperation is at arms-length, yet much of any
collection duplicates other collections. The client/server
model makes cooperation technically straightforward.
- one needs different search techniques for different
collections.
- one wants to search several public collections with a
single request. And to extend a search to private
collections without the need to first register their
existence.
Each of these concerns appears superficially to be well
addressed with a client/server model, but field experience with
a real design is needed to see if it actually works. At the
same time, several additional problems aren't automatically
solved by the client server model, and require further ideas:
- identification. Given a citation, how to locate the object cited.
Involves several interacting components--just what object is it,
how is it identified, what library holds it, how do we get our
hands on it. What part of this is done at the time a document is
archived, what part is done at the time a link is followed?
- user interface. How to provide a simple, intuitive model of
what is going on, especially if multiple collections are being
searched.
- cross-representation coordination. How does one relate a
mouse gesture on a displayed image to a particular set of
words in the ASCII form?
- coordination of multiple, possibly variant copies of information.
How does one discover that two things from different collections are
actually the same object, or a minor variation?
- allowing for authorship and reviewed publication of links, so
that users can concentrate on competent links.
- allowing new ideas for search to fit in gracefully.
The project will explore this area by developing a candidate
client/server architecture and building it to see how well it
actually works.
4. Frequently-mentioned problems that are left for others to work on.
- advanced search ideas (relevance and relatedness measures, image
and structure search, etc.)
- advanced user interfaces; bright ideas for improving navigation,
or providing better arrival and departure context.
- back-office automation improvements. (circulation/overdue,
acquisitions, serials control, cataloguing, interlibrary loans, etc.)
- ideas to deal with the probable dislocation of revenue flows.
(The proposed system would provide a good laboratory in which
others can explore innovative proposals in those problem areas.)
5. The method.
- build a prototype electronic library, with provision for
- bibliographic records
- abstracts
- full text
- page images
- simple keyword-based search
- multiple public collections
- unregistered private collections
- multiple-copy coordination
- interface to traditional library backoffice support systems
- a simple, portable, X-based presentation interface
- stock it, on a small scale
- AI/LCS reading room bibliographic records
- Computer science bibliographic records extracted from Barton
- AI and LCS Technical Report abstracts
- AI and LCS TR text (obtained by canvasing old WP files)
- AI and LCS TR page images
- investigate various directions
- how would it scale to an M.I.T.-sized library
(2M items)
- what parts would be applicable to a library the size of
the Library of Congress (60M items)
- try out storage architectures ideas intended to provide
multiple-decade persistance
- try out replication for backup
- explore placement of compression and forward error
detection inside the file system.
- begin stocking on the next larger scale
- collect machine-readable abstracts of TR's from as many
other Computer science organizations as possible.
- put online images, and if possible text, of recent M.I.T. theses
- do the same for all recent M.I.T. Technical reports
- make the full M.I.T. library catalog available
- upgrade two Athena clusters with displays suitable for
this application.
6. Time scale.
Three to five years.
7. Opportunities for cooperation.
Within M.I.T....
- The M.I.T. Library System wants to move in this direction once
it is proven, and has offered cataloguing data for experiments,
willingness to get into the scanning business and opportunities
to help guide future system plans.
- The M.I.T. Industrial Liaison Program, a major distributor of
M.I.T.-originated Technical Reports, would like to explore
making an on-line collection available to its corporate clients,
and is willing to provide another place to test the framework.
- The M.I.T. Technology Licensing Office is providing support both in
setting policy for access to M.I.T. materials and in working out
copyright assignment problems.
- The Advanced Network Architectures group of the M.I.T.
Laboratory for Computer Science wants to develop novel protocol
suites for cross-reference of information at different network
sites. The requirements of the electronic library will
influence the infrastructure they intend to develop and
vice-versa.
- M.I.T. Project Athena will provide its environment as an
experimental application delivery vehicle.
Beyond M.I.T....
- A cooperative venture is underway with a staff member at DEC CRL
to develop jointly the protocol that links an information client,
a storage service, and an indexing service.
- CMU (Project Mercury) already has working a similar framework, with
more emphasis planned on object-oriented document structure.
Has offered to exchange both software and ideas.
- Thinking Machines (WAIS project) has another similar framework
in operation, with more emphasis on search and document relations.
Has offered to exchange both software and ideas.
- DEC SRC has offered to exchange access to online Technical
Reports, and would provide an opportunity to learn how to make
things work across filtering gateways.
- A consortium of CMU, UC/Berkeley, Stanford, and M.I.T. has
proposed to exchange scanned computer science technical reports
and related bibliographic information, providing another opportunity
to work with multiple collections.
8. Resources required.
- a library willing to act as a prototype. (The LCS/AI reading
room is the current library of choice. M.I.T. theses and TR's
involve the M.I.T. Library System as part of a later step.)
- Hardware.
- storage. (total ~3.6 Terabytes, to be organized as a triply
replicated 1.2 Terabyte system, for persistent
image storage.) Fifteen servers each initially
equipped with twenty 1.2 Gbyte disks. To be later
upgraded to twenty 12 Gbyte disks each.
- indexing. Nine high-performance servers each equipped with
three 1.2 Gbyte disks and 128 Mbyte RAM.
(for index and search engines.)
- network. 10 Mbit/sec to start, can use upgrades as they
become available.
- display. Ten workstations for development use. Later,
upgrades for 80 Athena workstations, for delivery.
- misc: cut page scanner, two laser printers.
- People.
- 65% of one faculty member as supervisor.
- one professional staff programmer.
- two graduate student research assistants.
- six undergraduate research students.
- 75% of a FTE library staff person for cataloguing and assuming
novel collection management burdens.
- 25% of a support secretary.
- Space.
- Office space for the listed people. (4 or 5 offices.)
- Miscellaneous.
- Software acquisition (search, user interface, changes to
vendor systems)
- Travel
- System and Network maintenance
- Materials & services
- Overhead.
- M.I.T.
- L.C.S.
Return to Library 2000 home page.