These are working notes on the Library 2000 architecture. They are recently created and provide only a preliminary framework for further discussion. These notes are my own, but they are based on discussions with other members of the Library 2000 group. Although many of the ideas may not be my own, the inconsistencies and problems may well belong to me.
Now that I've made a single pass through all the issues that I had at least a note for, I'm not sure how to describe the contents of this document. It contains a mixture of three things: notes on our discussion, questions about how to flesh it out, and some rough ideas I had that I've included for purposes of further discussion. The last group is the oddball one; I'm not sure all the ideas are well thought-out but I've included them anyway.
In the area of repositories, the primary goal is 100% reliability. It would seem that high availability and good performance would also be desired goals. What kind of tradeoffs are involved in meeting more than one of these goals?
Availability. For major libraries, it should be possible for users with reasonable connectivity to get a document whenever they want to. Performance. It should not take an unreasonable amount of time for them to get the document. It seems quite possible that a highly available, good performing system would win out over a *highly* reliable system if users were making the decisions.
If it is not possible to design a reliable, available, good performing repository, what extra services could be provided that shift the burdens of availability and/or performance away from the repository? If this shift is necessary, should the new service become a major component of the architecture?
How do our naming schemes relate to the URI world? Do we want to adopt their conventions within our system? If we adopt URIs and starting building prototype services, then we might be in a position to influence URI formulation by virtue of having experience with a real implementation.
If we adopt URIs, one version of the name service becomes a URN resolver. For example,
Give name service: < URN:IANA:mit.edu.dl:MIT-LCS-TR-30 > Receive from name service: < URL:http://cs-tr.www.mit.edu/LCS/TR/30 >, or < URL:our_scheme://cs-tr.repository.mit.edu/4865738 >.
What does it mean to say that certain repositories have stronger claims as to the reliability for their bits? Is the software more thoroughly tested and engineered? Is there a completely different design being used by AAA+ reliable services and A+ reliable services?
Is it possible that different machines the work together as a repository have different degrees of reliability that are designed into the system? To what extent would that be a desireable complexity?
It would be interesting to spend some time developing a complete list of the kinds of failures we expect and how we would tolerate them. We had said that the most likely failure would be a single disk failing.
Are we interested in handling non-hardware faults? In Transaction Processing, for example, Jim Gray says that the MTTF for a workstation operating system is 3 weeks. He also observes that operator failures are most common when dealing with other failures; so, in our scenario, a disk drive fails and the user botches the formatting/mounting of the replacement and accidentally trashes more data.
One possibility for deduplication is a name space to name space translation that is applied to the results of a search. The destination name space has a set of rules about what constitutes the "same" document. By translating all IDs into that name space, we deduplicate by mapping different source names to the same destination name.
For example, I specify as my destination name space one in which the language of the documents doesn't matter -- the French, German, and English are all the same to me (perhaps they're all Greek to me). I search an index that returns ISBNs, where editions in different lanuages are different documents, and translate the result to my no-langauge name space.
This is strongly related to the availability and performance issue. If we assume that repositories are slow, or swamped with requests, or located far away from some places, then we can assume someone will make local copies of documents to avoid connecting to the repository.
One problem with local copies is that it may be hard to tell if the local copy is the same as the copy in the repository. Any of the failures that may effect a repository may effect a local copy, exceot that the local copy won't have any of the repository's protections. I guess the simple solution is to store a checksum with the document and compute it when accessing the cache.