The Library 2000 view of a Digital Distributed Library System (DDLS).

Andrew Kass, Sep. 1994

[Note: This document is derived from a 9/16/94 conversation with Jerry Saltzer, Mitchell Charity, Andrew Kass and Jeremy Hylton.]

The basic design concept of the DDLS is that it should be very simple and very robust. To accomplish this, each main piece of the DDLS is separated from the others. In addition, by not linking pieces together unnecessarily, such as combining the repository and index, we can achieve the maximum amount of service independence and flexibility in implementing each of the different components. Although this may mean some duplication of data, such as requiring that indexes keep their own bibliographic information for the documents which they index, it allows greater variety by not requiring the dependence. In addition, many services which add value to the base DDLS are implemented as "3rd party" services. These services are not required for proper functioning of the DDLS, and can have arbitrary availability and functionality without affecting the system as a whole.

There are three main components to the DDLS. Each one is independent from the others and can be run by completely different organizations.

I) The Repository.

The repository is mainly a set of large disks which contain the actual data in the library system. It has a minimalistic interface, consisting primarily of routines to put new data into the system, get stored data out, and to check what items are new in the repository. Each item in the repository is referenced by a unique ID.

Data stored in a repository is guaranteed to be stable. This can be accomplished by replicating the data within a particular repository. This replication is transparent to the user. Although a repository may keep 5 different copies of a document, each on a different storage medium, the copies appears to the outside as one document, referenced by its unique ID.

A sample interface list:

Get(ID) - returns data for a particular ID.

Put(data) - returns ID. To create a new entry into the repository. The repository assigns a new ID to the entry.

Whats_new(transID) - returns list of changes and transID. This function takes a transaction ID as an argument. It returns a list of all modifications made to the repository since the time of the transID given as an argument. It also returns the last transID to use in the next whats_new call. Using a transID value of 1 would return the listing of the entire contents of the repository.

Issues:

There are several issues that still need to be decided.

A) Security

How to ensure that records in the depository are only added and/or changed by people with the correct authorization? This could entail encryption authentication such as Kerberos. Or it could entail making the repository immutable; i.e. once a record has been entered it cannot be changed. This would prevent an adversary from changing records, but would not prevent the addition of illegal records without additional security.

There is another type of security issue. This is controlling access to documents. Certain documents may have restricted access. There are two direct solutions to this problem. One is to use a user authentication scheme, such as Kerberos. This allows a list of users to be attached to each restricted document who are authorized to get copies. However, there are many problems with this approach. One is that the repository must keep a list of authorized users for each document, which must be maintained somehow. Also, any third party which stores their document in a repository must trust the security of that repository. Another solution is to encrypt the documents. Only people with the proper decryption key can view the document, which is freely available in encrypted form. One downside to this is that if the key is compromised, the document is no longer protected. However, this is not a large problem, if we assume strong encryption. If the encryption is not decodable in a reasonable length of time, the only way the key can be compromised is if it is disclosed by a trusted individual or discovered in an unsecure place. The probability that this will happen is roughly the same that the decrypted source document will be disclosed or discovered. Therefor, nothing is lost by freely distributing encrypted documents.

B) ID generation

IDs can either be generated by the repository, or by a third party.

1) Generated by repository

In this scenario, when a new document is put into the system, the repository generates a unique ID for that document and returns it to the client. This approach ensures that every document in a particular repository has a unique ID and allows the repository to assign IDs in a way which it prefers (there may be technical or other reasons why it may prefer to use a certain ID namespace). However, the ID for a particular document will be unique to every repository (most likely). This problem can be solved with the use of nameservers (see III).

2) Generated in advance

This approach requires an ID to be given to the repository when adding a new document. This gives the advantage that document IDs have the possibility of being uniform over all repositories. However, this approach has the disadvantage of requiring that a given ID be available (and acceptable) for a repository before creating the new record.

C) Getting parts

Additional control should be given to what parts of a document the user requests. Clients should be able to obtain a list of formats the document is available in. They should then be able to request the complete document, or certain parts (i.e. bibliographic information, Chapters 1-3, Page 15, etc.) in any available format.

D) Updating

There may be times when a particular document needs to be updated. This can occur when errors have been made in the data given for the bibliographic information and/or the body of the document itself. In such cases, a client should be able to update certain information about a document, or the document itself. This is not available on an immutable repository - the old document ID would be discarded and a new document created with the updated information. This process does not apply to revisions of documents - those should be stored as separate documents.

II) The Index

An index is a service which performs searches for desired documents. An index is not linked in any way to a particular repository, although it is quite possible for an index to only reference one repository. More generalized, an index contains the bibliographic information for a set of documents from one or more repository.

An index receives queries for documents and returns a list of document IDs. Indexes may index a single, entire repository, many repositories, certain topics (such as all CSTRs), subsets of topics (such as all recent or "good" CSTRs), or any combination. Indexes may also use different algorithms or even AI to handle queries. By relaxing this constraint, each index can provide unique services and return different answers to the same query. Different users may prefer to use different indexes of the same material due to the way which they interpret queries.

The index returns a list of document IDs, each with a corresponding ID type. This information is used to locate the document itself when given to a nameserver. When a index records a new document in a repository, it may assign it an internal ID. If it does this, it updates a nameserver which is associated with that index to relate the repository ID to the index ID.

The index keeps its database current by using the whats_new command for each repository it indexes. This command returns a list of documents which have been added/changed since the last time it was performed (by means of using a transID key). The index then can determine what has changed for each document listed and update accordingly.

III) The Nameservice

A nameservice is a service which convert from one type of ID namespace to another. Given an ID and ID type, it returns a list of repository IDs and respective repositories. This information is obtained when a index records a document. When it assigns its own ID to the document, it informs a nameserver of the index's ID and the repository's ID. This is required for the nameserver to have the necessary information to map from one type of ID to another. Therefor, each index must have a nameserver associated with it, which it updates whenever it makes a new map from its own ID to a repository's ID. In addition, there can be other nameservers which map from an index's ID to ISBN, ISSN, LCCN, etc. A client can map from a known ISBN, for example, to a list of index IDs, which can then be mapped to a list of repository IDs.

IV) 3rd party services

In keeping with the concept of a very simple, very reliable core system for a DDLS, many services are implemented as "3rd party" services. These services can be run by anyone, and are complementary to the DDLS. 3rd party services are in general any service which adds value to the core services provided above. As there are no strict specifications for a 3rd party service, different implementations of similar services will offer different feature sets and reliabilities. This atmosphere provides the greatest flexibility for users and providers of services.

There are a variety of 3rd party services which are not required for the basic operation of a DDLS, but are of value. These services can either be a traditional service, which takes a list of arguments, performs a service on them, and then returns a new list, or an "in the pipeline" service. "In the pipeline" refers to a service which masquerades as being a repository, index or nameserver. For example, a format conversion service which is "in the pipeline" masquerades as a repository. A client connects to it and asks for a particular document in a specific format. The service then connects to the true repository and retrieves the document. It then converts it to the proper format and returns the converted document.

A) Deduplication

One service that would be useful, but is best implemented outside of the core system, is deduplication. This service takes many document IDs and eliminates duplicates. Duplication may occur when a single index contains the records from many repositories, when the same document is available in different languages or versions, or when many indexes are used. The service can return a single ID, or lists of IDs sorted by language, revision, etc.

B) Caching

Caching recently and heavily used documents is another service which may be very useful, but can be implemented in a variety of different ways. Some caches may be very robust, but only contain a few documents. Some caches may contain a variety of documents, but do not vouch for the validity of the information. These caches would probably clear a document from the cache at the slightest hint that it was corrupted. The size, reliability, speed and other characteristics of the cache would be extremely site dependent, especially since a cache is really only useful if it is local, or on a direct trunk line (unless it is being used to alleviate the load on a repository).

C) Format Conversion

A format conversion service would convert documents from one format to another. Different implementations will have different combinations of number of formats supported, speed and quality of conversion.

D) Multiple index

This type of service can act as a proxy to other indexes. When it is searched, it searches a number of other indexes and returns the combined set of results.



Return to Library 2000 home page.