Five Information Interchange Proposals

FIVE INFORMATION INTERCHANGE PROPOSALS

by Jerome H. Saltzer,
with pieces by Hector Garcia-Molina and ideas from many others
Version of October 22, 1993



This  summer, in a series of meetings including Robert Wilensky, Hector 
Garcia-Molina, Rebecca Lasher,  Andy Kacsmar, Vicky Reich, and Jerry 
Saltzer,  we noted that the various CNRI participants are beginning to 
collect bodies of data, but we haven't really put together any plans to 
exchange the data.  Out of this discussion, five related proposals 
intended to expedite information interchange among the CNRI participants 
emerged.  Alan Bawden, Tom Lee, Jeremy Hylton, Win Treese, and Mitchell 
Charity then helped debug the proposals.  This note describes the 
proposals and puts them on the table for debate.

First, a piece of background.  The CSTR group has agreed to use RFC-1357 
as a standard bibliographic exchange record.  That standard contains an 
open-ended (which means undefined) "RETRIEVAL" field, intended to provide 
the information needed to allow automated location of the rest of the 
document, such as the name of an FTP site and the name of the file at that 
site to transfer.  It also defines an "ID" field, a standard two-component 
document identifier, such as STAN//CSL-TR-91-004.  This identifier 
consists of, before the double slash, the identifier of the distributor of 
the document and, after the double slash, the identifier that this 
distributor uses for the document.

In discussions about use of RFC-1357 it became apparent that the concept 
of embedding retrieval information in a RETRIEVAL field is flawed.  All of 
the other information in a bibliographic record is relatively permanent, 
because it describes the document itself, which doesn't change very often.  
But the to-be-defined RETRIEVAL field has been assumed to describe the 
network site, the method of obtaining the document, and the available 
formats, all three of which we can expect to change (perhaps frequently) 
with technology and with development of new ideas.  In addition, this kind 
of retrieval information is properly the province of technologists, not of 
librarians; it is the only field in the bibliographic record that is not 
determined by bibliographic considerations.  One can thus argue that 
including this kind of retrieval information explicitly in the 
bibliographic record violates some kind of packaging or abstraction 
boundary.  The value in maintaining a set of information items about a 
document that is purely bibliographic is that the resulting bibliographic 
record can be freely circulated and there need be no inhibition about 
storing it for an indefinite time and forwarding it to others.

The first four proposals offer a solution to this problem.  The fifth 
proposal is on a slightly different topic, a short-term convention for use 
of the FTP protocol.

---------------------------------------------------------------------

Proposal 1.  Use the retrieval field for a permanent document identifier 
rather than a document locator.  For TR's, this document identifier should 
simply be a repetition of the bibliographic record ID field.  For example,

RETRIEVAL::  CSTR93:  STAN//CSL-TR-91-004

The reason for repeating the ID field, rather than simply specifying that 
retrieval should be done by using the current ID field is to leave 
flexibility for future alternative document identifying systems (which may 
well use different kinds of resource identifiers).  The label "CSTR93" 
identifies the naming system; it indicates that the following identifier 
is to be interpreted according to the rest of this proposal.  A different 
system would have a different system naming label.

---------------------------------------------------------------------

Proposal 2.  Use DNS to locate document distributors.  We propose to use 
the first part of the CSTR93 standard document identifier as a name that, 
when looked up appropriately, returns the name of a distribution site for 
that distributor's documents.  Further, we propose to use the internet 
Domain Name System (DNS) as the resolving mechanism for this information. 

The idea is that one takes the name of the distributor found in the
standard ID field (in the example above, "STAN"), appends a universally-
agreed-upon name of a designated name server, and hands this name to the 
DNS at its standard name-resolving interface.  Suppose, for purposes of 
discussion, we agree to set up a centrally-managed name server called 
"find.distpoints.lib".  Then we ask the DNS to resolve the name 
"STAN.find.distpoints.lib".

(Credit:  this is essentially the same idea used by the Hesiod service of 
Project Athena to locate configuration and management information.)  The 
DNS has the advantage that it is in place, and available on any host that 
uses the Internet, it is distributed and robust, code both for using it in 
applications and for managing data are widely available for many 
platforms, and it implements something fairly close to the required 
function.  It includes caching as well as cache time-out mechanisms and it 
is fast.

A feature of the DNS is that, when  requesting name resolution, one 
specifies what type of name record is wanted.  For various specific 
purposes, several types have been defined:  an IP address, a mail 
forwarding pointer, an equivalent primary name, and, for more general 
uses, an uninterpreted ASCII record.  For our purpose, we will have our 
name server return an uninterpreted ASCII record (type "TXT"), on which we 
will impose our own interpretation.

The returned record will contain two fields:  a protocol name, and a 
string of characters whose interpretation depends on the protocol; this 
second string will typically contain the name of a host on the internet 
and it may also include other information that is needed to get the 
interaction started properly.   Thus the DNS request 

      resolve (name  = "STAN.find.distpoints.lib",  type = "TXT")

will return a string such as,

      CSTR    lookup.library.stanford.edu

This returned string means "use the CSTR protocol and open a connection to 
"lookup.library.stanford.edu".

A site providing truly primitive service might arrange that the request

      resolve (name = "STAN.find.distpoints.lib", type = "TXT")

return the string

       FTP    ftpsite.CS.stanford.edu  /pub/techreportlist.text

which might mean that the client is expected to FTP some list of all 
technical reports, and presumably will find further details in that list.  
If a site provides that kind of response, the interpretation of the 
returned file would have to be worked out; we do not follow up that idea 
here.

The important concept here is that the client has just learned what 
protocol to use to get the desired technical report, and also the explicit 
contact information needed to initiate that protocol.

A useful feature of the DNS is that a single request may return multiple 
answers.  This feature can be used in two different ways, or both ways 
simultaneously.  First, it can be used to allow for multiple server sites, 
for robustness.  Thus the response

      CSTR    lookup.library.stanford.edu    techreports
      CSTR    lookup2.library.stanford.edu   techreports

would indicate that Stanford runs two (presumably identical) servers, for 
added availability.  Second, the feature can allow for multiple protocols.  
Thus the response

      CSTR    lookup.library.stanford.edu    techreports
      FTP     ftpsite.CS.stanford.edu        /pub/techreportlist.text
      NFS     ftpsite.CS.stanford.edu        /pub/techreportlist.text
      AFS     ftpsite.CS.stanford.edu        /pub/techreportlist.text

would indicate that the client can choose the protocol it prefers to use, 
or has available.

The CSTR and FTP, etc., protocols in the above discussion are merely 
stand-ins to explain the idea.  In proposal 4, below, we suggest a 
particular protocol for use in the project.

----------------------------------------------------------------------

Proposal 3.  CNRI should, initially, run the distribution site name 
service.  We propose that CNRI set up and manage a name server (the one 
prospectively named "find.distpoints.lib" in the example above) for this 
purpose, for the duration of the CS-TR project.  In the future, some other 
name-managing entity associated with library service might take this job 
over, but it will be easier to convince someone to do that after we have 
demonstrated its utility.  The name can be anything convenient (e.g., we 
could use "find.distpoints.cnri.reston.va.us",) because this string should 
not be embedded in clients; it is better handled as a configuration 
parameter associated with name type CSTR93, to be given to any client 
program that proposes to automatically interpret bibliographic records.  
However, we should begin planning now for future takeover, so lobbying to 
obtain a name like "find.distpoints.lib" would be appropriate.

----------------------------------------------------------------------

Proposal 4.  Finding the document.  Proposals 1 through 3 get the client 
the information needed to contact a service that knows how to interpret 
document identifiers.   A second step is needed to locate the document 
itself.  There are two subproposals here.

Proposal 4A.  Find the document by using DNS again.  The idea is that the 
distribution site would run (or subcontract to someone else to run) a 
domain name server that is prepared to offer location information about 
specific documents.  The response to the initial request to resolve the 
name "STAN" would thus be:

       DNS     stan.trs.compsci.library.stanford.edu

and the client would respond by preparing another request to the domain 
name system.  This second request would concatenate the remainder of the 
document ID field with the information returned by the first inquiry, as 
follows:

resolve (name = "CSL-TR-91-004.stan.trs.compsci.library.stanford.edu",
                         type = "TXT")

and Stanford's name server would respond with a standard document 
identification record containing, again, a protocol name, and a string to 
be given to that protocol to retrieve that document.  As before, the 
string would contain the name of the site to be contacted and other 
information needed to obtain the specific item wanted. Thus a possible 
response might be,

     WWW: http://docstore.CS.stanford.edu/pub/documents/TRs/TR-91-004

which provides a document locator in the World-Wide-Web standard form.  
The client would interpret this string to mean: invoke the World-Wide-Web 
system to obtain the document, and give it everything after "WWW:" as its 
argument.  Note that placing the WWW document locator in a DNS record 
addresses one of the main concerns about those locators:  they may change 
with time.  Since in this proposal the organization responsible for 
distribution also manages the domain name server that contains the 
document locator, that distributor can change the location or add a new 
protocol without the need to notify customers or distribute new 
bibliographic records.

Again, this second name server response may contain multiple records, 
indicating that the document is available via several different protocols, 
or from several different servers with the same protocol, or both.

In addition, one can now see more clearly how the response to the first 
name service request (to resolve the name "STAN") might be formed in the 
case that there is more than one distribution site.  Suppose that Berkeley 
maintains a duplicate copy of the Stanford TR's but it uses only one name 
server to handle both Stanford's and its own TR's.  In that case, the 
response to resolve "STAN" might be

       DNS     stan.trs.compsci.library.stanford.edu
       DNS     stan.trs.cstr.library.berkeley.edu

The client who receives these two records can choose whichever service 
seems most appropriate.  On the other hand, the response to resolve "UCB" 
might be

       DNS     ucb.trs.cstr.library.stanford.edu

and the name tables in the Berkeley name server will have no trouble 
distinguishing requests for UCB TR's from requests for Stanford TR's. 

There is one minor inconvenience in the proposal to use DNS:  TR names are 
constrained by the DNS naming conventions.  The primary constraint that 
DNS imposes is that it maps together upper- and lower- case letters.  That 
constraint probably does not present a problem for TR identifiers.  One 
might imagine that TR identifiers that contain embedded periods might be a 
problem, but it happens that DNS can handle such names, as long as the TR 
name is the leftmost component of the name being resolved.
 
Proposal 4B.  Find the document by invoking local already-existing lookup 
protocols.   Proposal 4A would be used by sites that do not want to 
provide their own lookup system.  A site that has a lookup system, such as 
PostGres, might arrange that the initial request to resolve, e.g., 
"UCB.find.distpoints.lib" return a record that says, e.g.,

        POSTGRES    pgserver.berkeley.edu    cs-tr-data

and a properly prepared client would know that one of its options is to 
open a postgres connectionto pgserver.berkeley.edu, ask to connect to the 
database named "cs-tr-data", and use PostGres queries to look up the 
particular technical report named in the ID field of the original 
bibliographic record.  (In that example, I am using free-form imagination 
to fill in the starting sequence of PostGres.)

-------------------------------------------------------------------------

Proposal 5.  Adopt a temporary FTP standard.  As a temporary starting 
point to allow experimentation within the CS-TR project, we propose that 
all sites make their document and image data available by FTP, with a 
standard directory-naming structure so that a distant client program can 
find its way around.  The key point is simply to agree on the directory 
naming
 tructure.  Here we describe two strategies: take your pick... 

STRATEGY 5-A
----------

We assume that all information about a single technical report is grouped 
under one directory, whose name is discovered by a scheme such as 
described above. Within this directory we will find a file named BIB, 
which is a copy of the report's bibliographic record.  Note that this file 
contains useful information for retrieval, such as PAGES (the number of 
pages in the report). 

The report may be stored in a variety of formats. For each format there 
will be a sub-directory, with a name for that format. For example, if the 
directory POSTSCRIPT is found, then the report is available in postscript. 

We will agree to use the following directory names, each of which
corresponds to one format type.  (This list can be extended by agreement 
of the participants, but should be kept to a minimum; what follows is just 
a first cut and it may already be too long) 

OCR
TEXT
IMAGE-300       (scanned image at 300dpi, tiff format)
IMAGE-600       (scanned image at 600dpi, tiff format)
POSTSCRIPT
LATEX
TROFF
TEX

The idea is not that all formats must be available for every TR, but 
rather that if a format is available, it will be found with the standard 
name for that format.  What is found within each sub-directory depends on 
the format: 

In image sub-directories we will find one file for each page of the 
report. Page 1 will be in file P1, page 2 in file P2 and so on. 

In all other types of sub-directories there will be a single file called 
DATA.

The appropriate DNS record to return for this case is

CSTR5A.FTP  < server site name >  < path name of the top-level directory >


STRATEGY 5-B
-----------

Under this strategy, there is again a single directory for each technical 
report, but there are no sub-directories. Instead, in the report directory 
we can find files with the following names: 

bib:    the bib file
NNNPPP.tif.z    A page image where NNN is the report number
and PPP is the page number (001,002, ...).
all.ocr.Z       The OCR version of the report, compressed
xxx.ps.Z        Where xxx can be any string; this is the postscript
version, compressed.

The appropriate DNS record to return for this case is

CSTR5B.FTP  < server site name >  < path name of the top-level directory >

Both of these strategies encode format availability information implicitly 
in file names, which is a weak and inflexible approach.  We really should, 
as soon as possible, undertake the next step in defining a standard (like 
the bibliographic record) way of describing available formats.  We have 
not included such a proposal here because it risks bogging us down in 
elaborate detail at a time when agreement on high-level strategies is the 
primary requirement.

------------------------------------------------------------------

The future:  The proposal to use FTP is short-term only.  It has the 
defect that all participants must use a single storage format, not 
necessarily the one that they would choose for their own system, so they 
may end up storing two or more copies of their image files.  What is 
really needed is to develop a protocol and call interface that provide the 
ability to retrieve an identified document more gracefully, with 
negotiation about which end (or a third party) should do format 
conversion,  what kind of compression is appropriate, fine-grained access 
control,  ability to follow links, flow-control and look-ahead, etc.  But 
there are still many unresolved research issues regarding this type of 
interface. While these issues are being debated, the simple FTP interface 
will let us start exchanging
documents. When a call interface is ready, it can be implemented using the 
directory structure developed for FTP, and both kinds of access can be 
thus used interchangeably during the transition from FTP to a better 
system.


Bibliography


RFC1035  P. Mockapetris, "Domain names - implementation and 
specification", 11/01/1987. (Pages=55) (Format=.txt) (Obsoletes RFC973) 
Obsoleted/Updated by RFC1348)

RFC1034  P. Mockapetris, "Domain names - concepts and facilities", 
11/01/1987. (Pages=55) (Format=.txt) (Obsoletes RFC973) (Obsoleted/Updated 
by RFC1348) (STD 13)

RFC1357  D. Cohen, A Format for E-mailing Bibliographic Records, July, 
1992.  (s=12) (Format=.txt)

[RFC1348, cited in the above citations, is a minor update that adds 
features unrelated to this proposal.]
Return to Library 2000 home page.