FIVE INFORMATION INTERCHANGE PROPOSALS
by Jerome H. Saltzer,
with pieces by Hector Garcia-Molina and ideas from many others
Version of October 22, 1993
This summer, in a series of meetings including Robert Wilensky, Hector
Garcia-Molina, Rebecca Lasher, Andy Kacsmar, Vicky Reich, and Jerry
Saltzer, we noted that the various CNRI participants are beginning to
collect bodies of data, but we haven't really put together any plans to
exchange the data. Out of this discussion, five related proposals
intended to expedite information interchange among the CNRI participants
emerged. Alan Bawden, Tom Lee, Jeremy Hylton, Win Treese, and Mitchell
Charity then helped debug the proposals. This note describes the
proposals and puts them on the table for debate.
First, a piece of background. The CSTR group has agreed to use RFC-1357
as a standard bibliographic exchange record. That standard contains an
open-ended (which means undefined) "RETRIEVAL" field, intended to provide
the information needed to allow automated location of the rest of the
document, such as the name of an FTP site and the name of the file at that
site to transfer. It also defines an "ID" field, a standard two-component
document identifier, such as STAN//CSL-TR-91-004. This identifier
consists of, before the double slash, the identifier of the distributor of
the document and, after the double slash, the identifier that this
distributor uses for the document.
In discussions about use of RFC-1357 it became apparent that the concept
of embedding retrieval information in a RETRIEVAL field is flawed. All of
the other information in a bibliographic record is relatively permanent,
because it describes the document itself, which doesn't change very often.
But the to-be-defined RETRIEVAL field has been assumed to describe the
network site, the method of obtaining the document, and the available
formats, all three of which we can expect to change (perhaps frequently)
with technology and with development of new ideas. In addition, this kind
of retrieval information is properly the province of technologists, not of
librarians; it is the only field in the bibliographic record that is not
determined by bibliographic considerations. One can thus argue that
including this kind of retrieval information explicitly in the
bibliographic record violates some kind of packaging or abstraction
boundary. The value in maintaining a set of information items about a
document that is purely bibliographic is that the resulting bibliographic
record can be freely circulated and there need be no inhibition about
storing it for an indefinite time and forwarding it to others.
The first four proposals offer a solution to this problem. The fifth
proposal is on a slightly different topic, a short-term convention for use
of the FTP protocol.
---------------------------------------------------------------------
Proposal 1. Use the retrieval field for a permanent document identifier
rather than a document locator. For TR's, this document identifier should
simply be a repetition of the bibliographic record ID field. For example,
RETRIEVAL:: CSTR93: STAN//CSL-TR-91-004
The reason for repeating the ID field, rather than simply specifying that
retrieval should be done by using the current ID field is to leave
flexibility for future alternative document identifying systems (which may
well use different kinds of resource identifiers). The label "CSTR93"
identifies the naming system; it indicates that the following identifier
is to be interpreted according to the rest of this proposal. A different
system would have a different system naming label.
---------------------------------------------------------------------
Proposal 2. Use DNS to locate document distributors. We propose to use
the first part of the CSTR93 standard document identifier as a name that,
when looked up appropriately, returns the name of a distribution site for
that distributor's documents. Further, we propose to use the internet
Domain Name System (DNS) as the resolving mechanism for this information.
The idea is that one takes the name of the distributor found in the
standard ID field (in the example above, "STAN"), appends a universally-
agreed-upon name of a designated name server, and hands this name to the
DNS at its standard name-resolving interface. Suppose, for purposes of
discussion, we agree to set up a centrally-managed name server called
"find.distpoints.lib". Then we ask the DNS to resolve the name
"STAN.find.distpoints.lib".
(Credit: this is essentially the same idea used by the Hesiod service of
Project Athena to locate configuration and management information.) The
DNS has the advantage that it is in place, and available on any host that
uses the Internet, it is distributed and robust, code both for using it in
applications and for managing data are widely available for many
platforms, and it implements something fairly close to the required
function. It includes caching as well as cache time-out mechanisms and it
is fast.
A feature of the DNS is that, when requesting name resolution, one
specifies what type of name record is wanted. For various specific
purposes, several types have been defined: an IP address, a mail
forwarding pointer, an equivalent primary name, and, for more general
uses, an uninterpreted ASCII record. For our purpose, we will have our
name server return an uninterpreted ASCII record (type "TXT"), on which we
will impose our own interpretation.
The returned record will contain two fields: a protocol name, and a
string of characters whose interpretation depends on the protocol; this
second string will typically contain the name of a host on the internet
and it may also include other information that is needed to get the
interaction started properly. Thus the DNS request
resolve (name = "STAN.find.distpoints.lib", type = "TXT")
will return a string such as,
CSTR lookup.library.stanford.edu
This returned string means "use the CSTR protocol and open a connection to
"lookup.library.stanford.edu".
A site providing truly primitive service might arrange that the request
resolve (name = "STAN.find.distpoints.lib", type = "TXT")
return the string
FTP ftpsite.CS.stanford.edu /pub/techreportlist.text
which might mean that the client is expected to FTP some list of all
technical reports, and presumably will find further details in that list.
If a site provides that kind of response, the interpretation of the
returned file would have to be worked out; we do not follow up that idea
here.
The important concept here is that the client has just learned what
protocol to use to get the desired technical report, and also the explicit
contact information needed to initiate that protocol.
A useful feature of the DNS is that a single request may return multiple
answers. This feature can be used in two different ways, or both ways
simultaneously. First, it can be used to allow for multiple server sites,
for robustness. Thus the response
CSTR lookup.library.stanford.edu techreports
CSTR lookup2.library.stanford.edu techreports
would indicate that Stanford runs two (presumably identical) servers, for
added availability. Second, the feature can allow for multiple protocols.
Thus the response
CSTR lookup.library.stanford.edu techreports
FTP ftpsite.CS.stanford.edu /pub/techreportlist.text
NFS ftpsite.CS.stanford.edu /pub/techreportlist.text
AFS ftpsite.CS.stanford.edu /pub/techreportlist.text
would indicate that the client can choose the protocol it prefers to use,
or has available.
The CSTR and FTP, etc., protocols in the above discussion are merely
stand-ins to explain the idea. In proposal 4, below, we suggest a
particular protocol for use in the project.
----------------------------------------------------------------------
Proposal 3. CNRI should, initially, run the distribution site name
service. We propose that CNRI set up and manage a name server (the one
prospectively named "find.distpoints.lib" in the example above) for this
purpose, for the duration of the CS-TR project. In the future, some other
name-managing entity associated with library service might take this job
over, but it will be easier to convince someone to do that after we have
demonstrated its utility. The name can be anything convenient (e.g., we
could use "find.distpoints.cnri.reston.va.us",) because this string should
not be embedded in clients; it is better handled as a configuration
parameter associated with name type CSTR93, to be given to any client
program that proposes to automatically interpret bibliographic records.
However, we should begin planning now for future takeover, so lobbying to
obtain a name like "find.distpoints.lib" would be appropriate.
----------------------------------------------------------------------
Proposal 4. Finding the document. Proposals 1 through 3 get the client
the information needed to contact a service that knows how to interpret
document identifiers. A second step is needed to locate the document
itself. There are two subproposals here.
Proposal 4A. Find the document by using DNS again. The idea is that the
distribution site would run (or subcontract to someone else to run) a
domain name server that is prepared to offer location information about
specific documents. The response to the initial request to resolve the
name "STAN" would thus be:
DNS stan.trs.compsci.library.stanford.edu
and the client would respond by preparing another request to the domain
name system. This second request would concatenate the remainder of the
document ID field with the information returned by the first inquiry, as
follows:
resolve (name = "CSL-TR-91-004.stan.trs.compsci.library.stanford.edu",
type = "TXT")
and Stanford's name server would respond with a standard document
identification record containing, again, a protocol name, and a string to
be given to that protocol to retrieve that document. As before, the
string would contain the name of the site to be contacted and other
information needed to obtain the specific item wanted. Thus a possible
response might be,
WWW: http://docstore.CS.stanford.edu/pub/documents/TRs/TR-91-004
which provides a document locator in the World-Wide-Web standard form.
The client would interpret this string to mean: invoke the World-Wide-Web
system to obtain the document, and give it everything after "WWW:" as its
argument. Note that placing the WWW document locator in a DNS record
addresses one of the main concerns about those locators: they may change
with time. Since in this proposal the organization responsible for
distribution also manages the domain name server that contains the
document locator, that distributor can change the location or add a new
protocol without the need to notify customers or distribute new
bibliographic records.
Again, this second name server response may contain multiple records,
indicating that the document is available via several different protocols,
or from several different servers with the same protocol, or both.
In addition, one can now see more clearly how the response to the first
name service request (to resolve the name "STAN") might be formed in the
case that there is more than one distribution site. Suppose that Berkeley
maintains a duplicate copy of the Stanford TR's but it uses only one name
server to handle both Stanford's and its own TR's. In that case, the
response to resolve "STAN" might be
DNS stan.trs.compsci.library.stanford.edu
DNS stan.trs.cstr.library.berkeley.edu
The client who receives these two records can choose whichever service
seems most appropriate. On the other hand, the response to resolve "UCB"
might be
DNS ucb.trs.cstr.library.stanford.edu
and the name tables in the Berkeley name server will have no trouble
distinguishing requests for UCB TR's from requests for Stanford TR's.
There is one minor inconvenience in the proposal to use DNS: TR names are
constrained by the DNS naming conventions. The primary constraint that
DNS imposes is that it maps together upper- and lower- case letters. That
constraint probably does not present a problem for TR identifiers. One
might imagine that TR identifiers that contain embedded periods might be a
problem, but it happens that DNS can handle such names, as long as the TR
name is the leftmost component of the name being resolved.
Proposal 4B. Find the document by invoking local already-existing lookup
protocols. Proposal 4A would be used by sites that do not want to
provide their own lookup system. A site that has a lookup system, such as
PostGres, might arrange that the initial request to resolve, e.g.,
"UCB.find.distpoints.lib" return a record that says, e.g.,
POSTGRES pgserver.berkeley.edu cs-tr-data
and a properly prepared client would know that one of its options is to
open a postgres connectionto pgserver.berkeley.edu, ask to connect to the
database named "cs-tr-data", and use PostGres queries to look up the
particular technical report named in the ID field of the original
bibliographic record. (In that example, I am using free-form imagination
to fill in the starting sequence of PostGres.)
-------------------------------------------------------------------------
Proposal 5. Adopt a temporary FTP standard. As a temporary starting
point to allow experimentation within the CS-TR project, we propose that
all sites make their document and image data available by FTP, with a
standard directory-naming structure so that a distant client program can
find its way around. The key point is simply to agree on the directory
naming
tructure. Here we describe two strategies: take your pick...
STRATEGY 5-A
----------
We assume that all information about a single technical report is grouped
under one directory, whose name is discovered by a scheme such as
described above. Within this directory we will find a file named BIB,
which is a copy of the report's bibliographic record. Note that this file
contains useful information for retrieval, such as PAGES (the number of
pages in the report).
The report may be stored in a variety of formats. For each format there
will be a sub-directory, with a name for that format. For example, if the
directory POSTSCRIPT is found, then the report is available in postscript.
We will agree to use the following directory names, each of which
corresponds to one format type. (This list can be extended by agreement
of the participants, but should be kept to a minimum; what follows is just
a first cut and it may already be too long)
OCR
TEXT
IMAGE-300 (scanned image at 300dpi, tiff format)
IMAGE-600 (scanned image at 600dpi, tiff format)
POSTSCRIPT
LATEX
TROFF
TEX
The idea is not that all formats must be available for every TR, but
rather that if a format is available, it will be found with the standard
name for that format. What is found within each sub-directory depends on
the format:
In image sub-directories we will find one file for each page of the
report. Page 1 will be in file P1, page 2 in file P2 and so on.
In all other types of sub-directories there will be a single file called
DATA.
The appropriate DNS record to return for this case is
CSTR5A.FTP < server site name > < path name of the top-level directory >
STRATEGY 5-B
-----------
Under this strategy, there is again a single directory for each technical
report, but there are no sub-directories. Instead, in the report directory
we can find files with the following names:
bib: the bib file
NNNPPP.tif.z A page image where NNN is the report number
and PPP is the page number (001,002, ...).
all.ocr.Z The OCR version of the report, compressed
xxx.ps.Z Where xxx can be any string; this is the postscript
version, compressed.
The appropriate DNS record to return for this case is
CSTR5B.FTP < server site name > < path name of the top-level directory >
Both of these strategies encode format availability information implicitly
in file names, which is a weak and inflexible approach. We really should,
as soon as possible, undertake the next step in defining a standard (like
the bibliographic record) way of describing available formats. We have
not included such a proposal here because it risks bogging us down in
elaborate detail at a time when agreement on high-level strategies is the
primary requirement.
------------------------------------------------------------------
The future: The proposal to use FTP is short-term only. It has the
defect that all participants must use a single storage format, not
necessarily the one that they would choose for their own system, so they
may end up storing two or more copies of their image files. What is
really needed is to develop a protocol and call interface that provide the
ability to retrieve an identified document more gracefully, with
negotiation about which end (or a third party) should do format
conversion, what kind of compression is appropriate, fine-grained access
control, ability to follow links, flow-control and look-ahead, etc. But
there are still many unresolved research issues regarding this type of
interface. While these issues are being debated, the simple FTP interface
will let us start exchanging
documents. When a call interface is ready, it can be implemented using the
directory structure developed for FTP, and both kinds of access can be
thus used interchangeably during the transition from FTP to a better
system.
Bibliography
RFC1035 P. Mockapetris, "Domain names - implementation and
specification", 11/01/1987. (Pages=55) (Format=.txt) (Obsoletes RFC973)
Obsoleted/Updated by RFC1348)
RFC1034 P. Mockapetris, "Domain names - concepts and facilities",
11/01/1987. (Pages=55) (Format=.txt) (Obsoletes RFC973) (Obsoleted/Updated
by RFC1348) (STD 13)
RFC1357 D. Cohen, A Format for E-mailing Bibliographic Records, July,
1992. (s=12) (Format=.txt)
[RFC1348, cited in the above citations, is a minor update that adds
features unrelated to this proposal.]
Return to Library 2000 home page.