Library 2000 Architecture: Hylton view 2

[This document provides an overview of the architecture of a digital
library. It is derived from discussions with and notes exchanged by
Jerry Saltzer, Mitchell Charity, Andrew Kass, and Jeremy
Hylton. 10/18/94]

We envision a global distributed library architecture that connects
many independent clients and servers to provide the kind of service
that traditional libraries provide. 

A digital library builds on the developing networked information
infrastructure. Improvements in underlying computer technologies, like
increasing disk drive storage capcacity and fast communication speeds,
make it possible to provide the quality of information access that
people expect from a library in an electronic medium. It is, of
course, easy to say that the enabling technology exists or will soon
exist; actually assembling the component technologies into a reliable
system is a large-scale engineering problem.

[need to incorporate Jerry's discussionion of "What, exactly, is a
library?"]

The architecture we envision is based on a few fundamental services --
storage services, discovery services, and navigation services -- and
the communications between them. At least as important as these basic
services, however, is a design philosophy of to compose higher level
systems from modules that reflect a single function. This allows
specialized services to be gracefully composed of smaller independent
and interoperable units.

I will begin with an overview of the fundamental components of the
system and a few examples of specialized services that can be layered
on top of them before delving into the specifics of each
service. Consider the following a definition of terms. The definitions
may be mutually recursive, but I believe they are well defined. :-)

* Storage service.

A storage service, or archive, is a permanent, reliable archive of
documents. It provides a simple interface that allows an (optionally)
authorized client to retrieve a particular document, in a particular
format, when given that document's name. The storage service provides
a strong guarantee that a retrieval operation on a specific document
will always return the same thing. (an archive may be internally
replicated to help maintain that guarantee.)

The storage service also provides a transaction-based interface for
archive maintainers to add and modify documents, and for clients of
the archive to discover changes to the archive. For example,
when a maintainer adds a document to a storage service, the system
assigns a transacation number to the action. Later, an index service
can ask the server "What's new?" by providing a transaction id; the
storage service responds with a list of all the new or modified
documents since that transaction.

* Discovery service.

A discovery service helps an authorized user find documents of
interest. It could be a simple card catalog or a full-text index; it
could limit its scope to a particular field or provide access to a
wide range of fields. This component of the system is the most
dependent on the decisions of the curator or reference librarian who
maintains the service.

The interface to the discovery service is minimal. A client submits a
query (a list of keywords, complicated search expression, Z39.50?) and
receives a list of documents that match the query. The format of the
response is somewhat flexible; the server may return only a list of 
document identifiers (see Document IDs, below).

A discovery service may also return some hints about each
document. A hint is some information about the contents of the
document named; it could be a short bibliographic entry or a
description of the available formats for the document. The hint is not
guaranteed to be correct, and with the document ID alone the user
could retrieve information that is guaranteed to be correct. The hint
is intended to help users determine the relevance of a particular
document, when retrieving the canonical information may be costly.

* Document.

[I'm not sure what a good definition is.]  A particular kind of
intellectual property defined, perhaps, by the decision of some
authority to publish it. There is probably a common-sense
definition of a document that is widely used, but at the fringes of
common use there are surely debates.

Examples of a document includes books, articles in a scholarly (or
popular) publication, technical reports or memos published by an
academic institution. We would also like to distinguish between
different versions of a document (first edition, second edition, etc.)
and different representations of a document (text, Postscript,
image). Different namespaces will have different rules governing when
two documents are equivalent.

Related to a document are its ID and its descriptor. Both are
described below.

* Document ID.

To borrow a definition from the URI working group: The purpose or
function of a document ID is to provide a globally unique, persistent
identifier used for recognition, for access to characteristics of the
document or for access to the document itself. A document may have
several names.

In particular, we distinguish between the name of a document and its
location. The combination of a (FQDN of a) storage service and the
name used by that storage service constitutes a location, not a name. 

A particular document ID belongs to a namespace. There can be any
number of namespaces, and thus a document can have different names in
different namespaces. There may be, but isn't necessarily, a way to
translate between different namespaces.

* Meta-information

Each of the services in the system may need information about the
documents they operate on; in general, this information can be
categorized as meta-information. For example, when a document exists
in multiple formats, a browser could use meta-information that
describes the different formats. Bibliographic information is another
kind of meta-information; it can be used in many ways -- helping to
determine the relevance of a document for a search will probably be a
common use.

There are many kinds of meta-information, primarily because of the
different characteristics of the data needed by a particular
service. (end-to-end)

meta describes content of info; it comments on other info;
so it is just info. needs to be treated as such

because of specific needs of services may be replicated and cached
throughout system

A document descriptor contains information about the characteristics
of the document. By characteristics, we mean information about
particular electronic representations of a document. Examples of
information contained in a descriptor include: available formats,
a list of storage services that hold the document, the size in bytes
of a particular format.

The descriptor serves as an extra layer of indirection between a named
document and its digital representations. When a client attempts to
retrieve a document, it must make decisions about what format it want
and which storage service to get it from; the descriptor facilitates
those decisions.

* Navigation Service.

A navigation service provides a necessary piece of glue between
discovery services and storage services. A discovery service provides
a list of document IDs, but an ID alone contains no information about
how to retrieve a copy of the named document.

When a client wants to retrieve a copy of a document, it asks a
navigation service to transform the ID of the document into a
descriptor. Based on the information contained in the descriptor, the
client can retrieve an actual copy of the document, contingent upon
authorization, making payments, etc.

The information provided by a navigation service (document
descriptors) must itself be stored in an archive. The descriptors
contain information necessary to retrieve and make sense of a
document. We envision navigation services acting as a highly available
copy of that information. It provides the descriptor as a hint; it is
likely, but not guaranteed to be correct.

A navigation service is not likely to understand how to resolve IDs in
every conceivable namespace. Instead, a navigation service will
perform the necessary inter-server communication necessary to resolve
the ID. If it does not understand the namespace, it will contact
another navigation service that does.

[Is this the right place to package the functionality, or should it be
the client's responsibility?]

* Specialized (Third-Party) Service.

There are many services which are not provided by the basic
architecture. The core service described above can be used as basic
building blocks to compose more complicated services. The underlying
architecture should support this composition.

There are many possible third-party services. We list a few as
illustrative examples:
- a service that converts a document from one format to another
on-the-fly;

- a service that determines if two names refer to the same document;

- a service that provides actual documents in response to searchs (a
composition of search service and archive);

- a service that queries several distinct indexes and returns the
results as a single response;

- a service that converts a standard bibliographic entry into a
name. (This is actually a specialized discovery service that specifies
a particular kind of query.)

* Authorization Service.

An authorization service can be viewed as yet another third-party
service, however, its existence is necessary to implement security
measures for storage and discovery services. 

While we do not describe the design of a specific cryptographic
service, we assume that a global key distribution infrastructure must
be in place to build a secure library system.

Rationale for Design

The library system is designed to be a long-term, persistent archive
for documents. Persistence is the primary goal. Given that goal, we
proceed with the mindset that lots of things can happen to a single
site.  Thus there must be multiple sites, with great pains taken to
maintain independent failure modes thru geographic, personnel,
practice, and software independence.

The kinds of failures we want to cope with include: hardware failures
of any component (or group of components); failures of the underlying
infrastructure, e.g. network failure; unintentional misuse by users of
the system; and software failures.

The rationale for dividing the system into the basic units that we did
is described below.

1. It is good design practice to build modules that contain a
single identifiable pieces of functionality. A simple and elegant
module makes reasoning about the module's correctness easier; it is
easier to prevent or guard against software failures when the software
is (relatively) simple.

2. Different services will be deployed in different numbers within the
system. We expect that the makeup of of the system to look something
like this:

- A client program will run on every desktop workstation.

- Each library will maintain its own discovery service, reflecting its
curational decisions about which documents to collect. Many
specialized collections may exist separate from libraries.

- Relatively few large archives. Perhaps one for each large
publisher or library. Many smaller institutions will probably contract
out storage, because of the difficult and expense of maintaining a
reliable storage service.

3. A modular architecture provides more graceful scaling and eases
interoperability.

4. Fault tolerance. Core services should not depend on the existence
of other services to operate. And individual services can be split
across several machines or replicated to allow greater fault tolerance.

5. High availability. By making this a widely distributed system, we
expect to increase the availability of any particular service.



Notational Note
Op: X -> Y, Z*
The operation takes X as input and returns Y and zero or more Zs. 

Data types:
ID = a name for data in an archive
Data = bits
TID = a transaction ID for a particular archive

** Storage Service **

The basic storage service interface contains three operations: get,
put, and modify. For effective management of the archive, we add a
transaction system and some operations related to it. 

Get: ID -> Data

The client application presents an identifier to the repository and
receives some bits.  Assume the each request either generates the
correct bits or an error because the ID is invalid.

Put: Data -> ID, TID



Modify: ID, Data -> TID

What's New: TID -> TID, ID*

** Discovery Service **

** Navigation Service **



[ below is more working notes than finished product ]
** Document Creation Scenario **

I'd like to flesh out the administrative task of adding a document to
an archive. Specifically, I want to consider the task of creating a
document descriptor that will reside in the archive along with the
individual representations of the document.

I assume that document representations (almost?) never change, and
that document descriptors change only infrequently. 

The name refers to the document, which can have multiple
representations. 

Steps to create and archive a document:

1. Produce digital representation(s) of document
2. Register the document with a namespace (of your choice)
3. Create a document descriptor that describes each of
   the possible representations.
4. Enter both the document represention(s) and the
   descriptor in the archive

Where do we go from here?

1. The publisher now has a permanent name for the document. He can ask
a particular navigation service to create a resolution record for it;
this could be part of the process of registering a name. Once the
document is archived, and a resolution record exists at a navigation
service, the publisher can send advertisements to librarians asking
them to catalog the document.

2. Both discovery services and navigation services occasionally ask
archives, "What's new?" Each will discover the new document and its
descriptor. 

2a. The discovery service will read the document descriptor and decide
on a format that is appropriate for it to index (e.g. text, bib). In
its index, it will record the name of the document as given in the
descriptor.

2b. A navigation service will keep copies of all the descriptors,
along with the place it got them from. The navigation service can use
this information to resolve names, and can provide the descriptor to
its clients as a hint.

Other issues; some are ideas I like, some are just ideas

- My big concern about naming is the following: Imagine some enormous
catastrophe knocks out all of the non-iron mountain storage in the
library system. We should be able to recover just from the information
in iron mountain. Thus, we need to store a descriptor in iron mountain.

- Perhaps we can limit the mutability of the descriptors. They could
be logs, i.e. you can add things to them but you can't remove things.

- When a new representation of a document is added to an archive
(e.g. PostScript level 3), the document descriptor must be changed to
reflect the new version. The descriptor is stored in the archive, so
this implies that the archive must be mutable.

- At the same time, it might be desireable, but not necessary to add
an extra name to the list of names in the descriptor. Thus, I register
for a new name, an LCCN, and I add that to the descriptor.

- A possible problem arises. If different archives have different
representations for a document, they will have different descriptors,
but potentially the same name. Thus, the name is not unique in the
sense that it refers to exactly one sequence of bits. I don't think
this is a problem, but rather the sense of "uniqueness" needs to be
understood.

- This discussion really punts on the issue of creating names. What
does it mean administratively and technically to register with a
namespace?

I would advocate using checksums, except that we don't have anything
in particular to checksum; the descriptor needs to be mutable, so we
can't use it.

In particular, I think the use of checksums implies that the
data we use to generate the checksum be immutable, so that someone
else can checksum the data and verify that he got the right
thing. There is another model which suggests we just use checksum to
get a unique name with high probability.

One option: Use the checksum of an arbitrary representation of the
document. If we pick a representation that will not change (e.g. not
the descriptor, maybe not the bibliographic info), then the name is
unique.
Last update 10/18/94 by jeremy
Return to Library 2000 home page.