Library 2000 Architecture

These are working notes on the Library 2000 architecture. They are recently created and provide only a preliminary framework for further discussion. These notes are my own, but they are based on discussions with other members of the Library 2000 group. Although many of the ideas may not be my own, the inconsistencies and problems may well belong to me.

Now that I've made a single pass through all the issues that I had at least a note for, I'm not sure how to describe the contents of this document. It contains a mixture of three things: notes on our discussion, questions about how to flesh it out, and some rough ideas I had that I've included for purposes of further discussion. The last group is the oddball one; I'm not sure all the ideas are well thought-out but I've included them anyway.

Major conceptual divisions in the architecture

The overall architecture of the system is a loose federation of clients and services. Each user has a client application that communicates with several servers to perform user-level tasks, like searching for and viewing documents.

Repository

Service provided

A repository is a highly reliable storage service. Users may enter and retrieve data with the assurance that the data entered is the same as the data retrieved. Internally, the repository is composed of several replicated storage services.

Interface

Get: ID -> Data

The client application presents an identifier to the repository and receives some bits. Assume the each request either generates the correct bits or an error because the ID is invalid.

Put: Data -> < ID, TID >

Authorized clients can place new items in the repository with put. The client presents a sequence of bits and the server returns the newly created identifier for the document and a transaction identifier that can be used to order all operations that change the state of the server.

What's New: TID -> < TID, ID* >

A client providies a transaction id and the server responds with a list of all the ids that have been added or modified since the transaction specified. The server also responds with the current transaction identifier.

Modify: < ID, Data > -> TID
Should this be included?

This operation would allow a user to change or delete an object placed in a repository.

Guarantees

Once a put operation has completed, the same bits will always be returned by a get operation with the appropriate ID.

Questions & hypotheses

If there are several machines participating in a replication group that is logically a single repository, how does one connect to a repository? At what level is the multi-machine composition of the repository exposed.

Each repository creates its own identifiers and has an internal mechanism for mapping those IDs to data.

Can data on the server be removed or modified? If data can only be added to a server, the replication model is somewhat simplified (though I don't understand the specifics of this claim). Given the current state of publishing, where changes to book are relatively rare and where major revisions, like new editions, often involve considering the two books to be different, modifications seem to be a relatively rare case. Removal of items could be a useful feature, e.g. data maliciously added, data comes to be deemed "dangerous" by provider, cutbacks in scale, size, funding of service require removal of data.

Index

Service provided

An index lets users make queries over a body of data, and returns the names of documents. The kinds of searches that are possible and over what kind of data they are possible is left to the provider of the service.

An index service seems to be the part of the system that is most like a traditional library. Users would choose indexes they wanted to search when looking for information. The index may want to make some claims about completeness (we have all the neural net papers), or quality (we have only the good neural net papers).

Interface

Search: Query -> ID*

Guarantees

Some kind of guarantee seems appropriate, but I'm not sure what.

Questions and Hypotheses

What the difference between an index and a repository? I suspect much of the difference is a trade-off between reliability and performance / availability. Indexes should have up-to-date data, and they should be easy to access; they may not go to great lengths to have persistent data. At the same time, the data that makes up an index of quality papers might belong in a repository.

What kinds of IDs does the index return? I think it returns a URN-style name like

< URN:IANA:mit.edu.dl:MIT-LCS-TR-30
>

. An index could return IDs from more than one namespace, or more than on ID per document.

What is the maintainer/manager's interface to the service?

Name Service

Service provided

A name service translates IDs from one namespace to another, given the namespace to translate to. At one level, the namespace can translate from ISBN to LCCN; at another level, it would translate from name to location, e.g. URN to URL.

Interface

Resolve: -> IDb
IDa and IDb are IDs from different namespaces. IDb is in the namespace specified by NS.

Guarantees

IDs should be self-identifying. If a name service purports to handle a particular namespace, I should be able to give the service any name and be told unequivocally whether the names belongs to that namespace, i.e. that it is a properly formed name. The scheme portion of a URL provides this function.

Questions and Hypotheses

What is the maintainer/manager's interface to the service? How do I create a name space and advertise it to the world.

If a name service purports to handle naming authority FOO, I should be able to resolve any name given by authority FOO (as long as it is resolveable). This may involve the name service contacting another name service for FOO, because it doesn't know how to resolve a particular name, but the client needn't know about that.

This may be the difference between an index and a name service. Ideally, an index provides a rich set of tools for querying its source data, but make no claims about being able to find the document indexed. The name server's contract is the opposite: I have a very simple query language (name, namespace), but I will always find the document for you.

Client

I'm not sure that we have talked about the client. It should hide as much of the architecture from the user as possible. But it should have the power to expose a lot of the underlying architecture to the user when asked to do so; for example, when the client application needs to get a document from a repository, it should probably choose the most expedient location, but a user may want to specify that the app should always use the MIT repository because MIT scanned with greyscale.

Questions

Availability & performance

In the area of repositories, the primary goal is 100% reliability. It would seem that high availability and good performance would also be desired goals. What kind of tradeoffs are involved in meeting more than one of these goals?

Availability. For major libraries, it should be possible for users with reasonable connectivity to get a document whenever they want to. Performance. It should not take an unreasonable amount of time for them to get the document. It seems quite possible that a highly available, good performing system would win out over a *highly* reliable system if users were making the decisions.

If it is not possible to design a reliable, available, good performing repository, what extra services could be provided that shift the burdens of availability and/or performance away from the repository? If this shift is necessary, should the new service become a major component of the architecture?

Privacy & authentication

Some services may only be available to authorized users. How do we grant authentication? At what level should it be done?

Who has PUT (or MODIFY) access to a repository?

What kind of logging should take place and who should be able to view it? What privacy should users have? How much info should publishers get, e.g. for billing?

Naming schemes

How do our naming schemes relate to the URI world? Do we want to adopt their conventions within our system? If we adopt URIs and starting building prototype services, then we might be in a position to influence URI formulation by virtue of having experience with a real implementation.

If we adopt URIs, one version of the name service becomes a URN resolver. For example,

Give name service: < URN:IANA:mit.edu.dl:MIT-LCS-TR-30 >
Receive from name service: < URL:http://cs-tr.www.mit.edu/LCS/TR/30 >,
 or < URL:our_scheme://cs-tr.repository.mit.edu/4865738 >.

Mountain-ness of replication

What does it mean to say that certain repositories have stronger claims as to the reliability for their bits? Is the software more thoroughly tested and engineered? Is there a completely different design being used by AAA+ reliable services and A+ reliable services?

Is it possible that different machines the work together as a repository have different degrees of reliability that are designed into the system? To what extent would that be a desireable complexity?

Kinds of failures

It would be interesting to spend some time developing a complete list of the kinds of failures we expect and how we would tolerate them. We had said that the most likely failure would be a single disk failing.

Are we interested in handling non-hardware faults? In Transaction Processing, for example, Jim Gray says that the MTTF for a workstation operating system is 3 weeks. He also observes that operator failures are most common when dealing with other failures; so, in our scenario, a disk drive fails and the user botches the formatting/mounting of the replacement and accidentally trashes more data.

Other potentially useful services

Deduplication

One possibility for deduplication is a name space to name space translation that is applied to the results of a search. The destination name space has a set of rules about what constitutes the "same" document. By translating all IDs into that name space, we deduplicate by mapping different source names to the same destination name.

For example, I specify as my destination name space one in which the language of the documents doesn't matter -- the French, German, and English are all the same to me (perhaps they're all Greek to me). I search an index that returns ISBNs, where editions in different lanuages are different documents, and translate the result to my no-langauge name space.

Caching

This is strongly related to the availability and performance issue. If we assume that repositories are slow, or swamped with requests, or located far away from some places, then we can assume someone will make local copies of documents to avoid connecting to the repository.

One problem with local copies is that it may be hard to tell if the local copy is the same as the copy in the repository. Any of the failures that may effect a repository may effect a local copy, exceot that the local copy won't have any of the repository's protections. I guess the simple solution is to store a checksum with the document and compute it when accessing the cache.

Format conversion

Important Ideas

Changing technology

Many, varied services; decoupling

jerhy@lcs.mit.edu
Last Update: 26 Sept 1994
Return to Library 2000 home page.