Linking

This is the text version of a paper originally prepared with FrameMaker, so 
details such as italics and footnotes are missing.  The full citation of the 
paper is as follows:

  Saltzer, J. H., "Needed:  A Systematic Structuring Paradigm for
  Distributed Data," Operating Systems Review 27, 2 (April, 1993),
  pp. 77-81.  Originally distributed as paper #41 in 5th ACM
  SIGOPS Workshop on Models and Paradigms for Distributed Systems
  Structuring, September 21-23, 1992, Le Mont Saint-Michel, France,
  pp. 1-5.

This paper is also available in PostScript form.

==============================================================================

NEEDED: A SYSTEMATIC STRUCTURING PARADIGM FOR DISTRIBUTED DATA

by Jerome H. Saltzer
Library 2000
M.I.T. Laboratory for Computer Science
September 10, 1992



The purpose of this note is to alert the distributed and operating 
systems communities to a research and design exercise that is going on 
in an adjacent application-oriented community. This research and design 
exercise involves inventing a paradigm for distributed system 
structuring that is distinctly different from that of remote procedure 
call and related paradigms that are the usual focus of the distributed 
systems community.

The class of applications involved might be roughly characterized as 
distributed, linked data. The remote procedure call paradigm, in which 
a client asks a server to perform some named computation on a set of 
supplied arguments and return a result, is not an especially congenial 
fit to this application, although it may be a useful tool at a lower 
level of abstraction in resolving links. Perhaps because the application 
of linked data has not been widely considered, it is not clear yet what 
is an appropriate form for the structuring paradigm, or even whether or 
not the problem has been posed carefully enough to allow one to propose 
specific structures. This note describes the problem as it has been posed 
in the distributed data community, and points to a few early attempts to 
suggest mechanisms; it does not claim to settle the issue.

The interesting requirement in distributed storage of related data 
lies in those data relations that can be characterized as cross-
reference. There are many different applications in which one object 
might need to contain a cross-reference to another, remote, object. As 
examples, one might propose to build a distributed hypertext system, a 
sales reporting system for a corporation that has many branch offices, 
an office system that allows one worker to incorporate an element of a 
remote spreadsheet into a local one by reference, a distributed legal 
case database, with cross-references among cases, and also from state to 
federal cases and back, or simply an electronic library.

For concrete scenarios consider the specific application of an on-
line electronic library (comparable scenarios arise in most of the other 
application examples.) In the electronic library, the primary scenarios 
of interest are the following: The maintainer of a document storage 
service has installed a new document and wants to advertise the new 
document to clients and clients of clients, and allow search services to 
store and pass along cross-references (in this application a cross-
reference is usually called a "citation") to the new document. In a 
related scenario, the client of a search service has completed a search 
and wants to be able to store a persistent cross-reference to one of the 
items discovered. Storing the search query that was used to discover the 
item is one common suggestion, but that method is not very satisfactory, 
because that particular search may have returned several items in its 
result set, or it may have been framed in terms of something unrelated 
to what makes the document interesting. In addition, since collections 
grow and shrink, there is no guarantee that the search service will 
return the same result set for the same query at a future time--especially 
if the document itself might change in such a way as to cause the query 
not to find it. In yet another related scenario, the user of a document 
discovers in it a cross-reference to another document, and wants to make 
a copy of that cross-reference, in persistent storage for future use. 
And in a final scenario, a client includes a cross-reference in a 
document which it then submits to be stored by some storage service, 
possibly a different one.

In each scenario, some client intends to place the cross-reference in 
persistent storage somewhere, for presentation to the storage service at 
an unknown time in the future. When the client eventually presents the 
cross-reference to the storage service, it hopes to retrieve the cited 
item.

In an electronic library, such cross-references may be used:

*	by an author, in creating a new document, to cite previous, related 
documents.

*	by the librarian of a storage service, in adding a document to a 
collection, making concrete an author's traditional text citations.

*	by a user of a search service, to prepare a list on a personal note 
pad of discovered documents.

*	by a search service as the internal connection between its index and 
the documents held by a storage service. (e.g., to connect a 
bibliographic record with the corresponding document).

*	by a search service, as the method of telling the client how to obtain 
the item from a storage service.

*	by a client, to pass along to another client

*	by an author, to cite anchor (that is, identified internal) points 
within another document.

The following simple scenario perhaps captures best of all the 
required structuring behavior: At the application level, a user has 
brought up a document in a display window, and upon browsing through it 
noticed that it mentions another document. The browser provides as a 
feature that the user be able to point to the cross-reference with the 
mouse, click, and expect to see the cited document appear on the screen 
in another window.

There are a number of interesting constraints on the solution, at 
least in the electronic library application. Some of the other 
applications have analogous constraints:

*	The storage service to which the cross-reference is eventually 
presented may be different from the one that provided the original 
cross-reference--it may be a backup system, or even a competitive 
provider.

*	The interval between creation and use of the cross-reference can be 
quite long, perhaps decades, during which time the storage service may 
have been upgraded, relocated, merged with other storage services, and 
placed under different administrative control. 

*	Cross-references will persist long enough that the system that is 
intended to resolve them will become obsolete.

*	Between creation and use of the cross-reference, the cited document 
may have been deleted, updated, or superseded. Something graceful 
should happen if multiple versions are available. If the "document" 
is actually a dynamic piece of data such as a current stock quote, 
perhaps something different, yet graceful, should happen.

*	Between creation and use of the cross-reference, the organization of 
the library may have been revised, and the target document may now be 
classified differently.

*	Between creation and use of the cross-reference, the physical and low-
level logical configuration of the storage server may have been 
revised, and the target document may be in a different directory or 
on a different physical volume.

*	Between creation and use of the cross-reference, the storage 
representation of the target document may have been discovered to be 
defective, and restored from a backup copy.

*	The initial discovery mechanism may identify only the document; a 
later discovery process may identify anchor points of interest.

*	The user who presents the cross-reference may or may not be authorized 
to obtain the document. 

*	A single document may by indexed by several different search services 
that are under different administrations.

*	In response to presentation of a cross-reference, a storage server 
may, rather than delivering the desired document, instead return 
another cross-reference.

*	If a client makes inquiries of several different search services, it 
would like to merge the several responses, which requires that it be 
possible to figure out which of the returned cross-references are 
duplicates.

As can be seen from this laundry list of constraints, the requirements 
on cross-references among distributed data objects read more like the 
list of requirements for a sophisticated name service than they are like 
the requirements on an RPC service.

It would appear that at the minimum, a cross-reference internally must 
be composed of at least two components. The first would be an identifier, 
perhaps to be presented to a name server, that allows the client to 
discover an appropriate and current server name, port, and protocol to 
use, and to verify upon connection that the service at the other end is 
the intended one. A second component probably is a specific object 
identifier that server is expected to recognize. Beyond that, one moves 
into the realm of speculation. There might be an expiration date after 
which the server doesn't guarantee to honor the cross-reference, and 
perhaps also a backup query, which might be useful in identifying the 
object after the expiration date (or in the case of some other failure) 
of the original cross-reference. Finally, one might want to include some 
kind of check data to verify that the object retrieved is actually the 
one that was previously cited. Another potential component, whose 
rationale is much less clear, is the identity of an application that 
knows how to interpret the stored object, and that should be launched in 
conjunction with the arrival of a response from the server, or perhaps 
spliced into the path between the client and the server.

The details of how such a cross-reference might be engineered, so that 
it can be stored, passed from client to client, and in the end be 
recognizable by the server, are an interesting design challenge. Several 
projects have run up against the challenge, and have suggested various 
strategies that solve parts of the problem. At least six somewhat 
different ideas are extant:

*	Tim Berners-Lee has proposed the cross-reference scheme used in his 
World-Wide Web. This proposal is moderately complete, but it 
concentrates mostly on developing a syntax that can be parsed by a 
computer and also read by a person. It takes the view that it should 
be possible to create a document identifier that is both unique and 
perpetually valid.

*	Clifford Lynch, to stimulate discussion of the topic within the 
Coalition for Networked Information developed a list of requirements, 
and for each some observations about mechanics that might address that 
requirement.

*	Brewster Kahle has proposed the cross-reference scheme used in his 
Wide-Area Information Service. 

*	F. H. Ayers has made a proposal for a universal standard book number.

*	Theodor Nelson, for the Xanadu  hypertext system, proposed a 
universal, hierarchical document numbering scheme with provision for 
versions and internal anchor points. It covers several of the 
requirements mentioned earlier by assuming complete homogeneity among 
the linked items.

*	Apple Computer, in the System 7 Alias Manager for the Macintosh, has 
worked out a sophisticated system for linking files within a Macintosh 
and across a network of cooperating machines. The Alias Manager uses 
a combination of symbolic relative and absolute path names as well as 
unique file, volume, and system identifiers, to maintain links in the 
face of renaming, hierarchy restructuring, and restoration from 
backup.

Each proposal addresses one or more parts of the problem, but none 
covers the entire range of requirements. More important, the discussions 
by Lynch and by Kahle are characterized by mentions of requirements not 
met, and possible alternative approaches, together with questions about 
whether or not the requirements are real. Interestingly, almost as if to 
recursively emphasize the need for a solution to this problem, most of 
the current literature on the subject is not found in traditional 
journals, reports, or libraries, but rather is found only on-line in 
various repositories within the internet.



Conclusion

As mentioned at the outset, this note has only described the problem, 
not proposed any solutions; even this description is not likely to 
resemble the one that will someday seem obvious.



Acknowledgements

Ideas and suggestions came from discussions with Mitchell Charity, 
Jim O'Toole, Mark Day, Tim Berners-Lee, Andrew Birrell, Dave Redell, Paul 
McJones, and Ron Weiss.



Bibliography

Tim Berners-Lee, Jean-Francois Groff, and Robert Cailliau. "Universal 
Document Identifiers on the Network." CERN, February 1992. On-line 
location: info.cern.ch:/pub/www/doc/udi1.ps

Clifford Lynch. "Workshop on ID and Reference Structures for 
Networked Information." 24 October 1991. (Call for participation. On-
line location: CNI-ARCH listserv mailing list at uccvma.bitnet. Also 
found in WAIS-discussion digest #33, 27 November, 1991.)

Brewster Kahle. "Document Identifiers, or International Standard Book 
Numbers for the Electronic Age." Version 2.2, September 1991. On-line 
location: quake.think.com:/pub/wais/doc/doc-ids.txt

F. H. Ayres. "The Universal Standard Book Number (USBN): why, how and 
a progress report." Program: Automated Library and Information Systems 
10, 2 pp. 75-80. (London: Association for Information Management: April, 
1976)

Theodore H. Nelson. Literary Machines, Edition 87.1. (San Antonio, 
Texas: Theodor H. Nelson: 1987)

Apple Computer.  "Alias Manager," Inside Macintosh Volume VI, chapter 
27. (New York: Addison-Wesley: 1991)
Return to Library 2000 home page.