Yoav's Replication Notes, July 6

Date: Wed, 6 Jul 1994 15:49:19 -0400
From: yoav@reading-room-4.lcs.mit.edu
Message-Id: <9407061949.AA15247@reading-room-4.lcs.mit.edu>
Subject: replication

I'm going to make an attempt to explain what Mitchell and I have devised as a possible concept for a replication system. Please forgive me if I mis-explain things or leave stuff out. I think the best way to explain and develop this would be via continuous email, so here is the starting point.:

By the way, Neither Mitchell and I see this as the final draft, or agree with the complete concept as is written here, but this is just intended to start up discussion.


The replication system will consist of the basic replication server system upon which will be layered many third party services. The replication system will be capable of supporting only a very minimal protocol for file deposit, retrieval, as well as some information queries about the server.

In our design of the system, we had solved several problems by forcing things on the system. Firstly, we chose not to assume that a system would be secure on the large time frame we were

considering. We therefore chose to make all itmes stored in the system immutable (although we came up with several schemes to support mutations when needed). Secondly, we chose to design a scheme which would allow for consistent namespace across all replicas without the need for centralized name-servers or concurrency. We therefore chose to use something unique about the file for an identifier.

Every object in this replicated system will be identified by a special signature. There are two basic assumptions for this scheme to work. Firstly, the scheme must be designed so that the probability of an identifier collision is extremely low (the universe is more likely to end before a collision would occur.. etc.). Secondly, the scheme must be non-reversible at the time of the file's deposit. To achieve this, the signature will consist of two parts. The first is a signature scheme identifier, and the second is the actual signature. The identifier should refer to a standardized scheme, while the signature should be the cryptographical signature generated by it. In short then, every object is a pair of signature and associated bits.

Finally, before deliving into the implementation, one final issue was discussed in order to explain the operations of the system. We tried to understand the concept of "trust" and its implications. We defined trust to be the replica server's concept of the outside world (so that it may decide which replica's votes it will consider with possibly what weights etc.) Using this, we envisioned the possibility of selective replication, and realms of replicas. There are, however, many complex issues that arise with these, and we probably have not analyzed them fully yet.

Now, I shall delve into the implementation and operation of the replication system:

Firstly, the static scenario:

In the static model, no file is intentionally added or deleted, all changes are errors. The following errors, therefore, can occur:

  1. Data corruption in bits. Although the signature remains the same, the associated bits have become corrupted. This is easily noticed, and both the bits and identifier should be discarded by the system (ignored). The update cycle should then install a new and correct copy of the file.
  2. Data corruption in signature. Since this error is usually indistinuishable from the first, the same solution is applied. Both the bits and the signature are lost. (Lost is a very inaccurate word, since the system would like to preserve immutability of files in eve extreme cases. Correct handling of this might be to spool it off to some backup tape and the erasing it, just in case).
  3. Data corruption in both. This could be split into two distinct categories: The much more likely one is that the new signature fails to represent the new bits, so like the previous, they are lost. The second is an almost inconceivable scenario, the new bits match the new signature. In our scheme this is an intolerable failure, and the way we deal with it is just leave it there (although by using logs of changes and old records this could be detected, the need for simplicity as well as the very low likelihood of this occuring AND interfering with the operation of the system is minimal).
  4. loss of data: Using the polling method described previously (though modified to support the concept of trust sets), the server finds out which files it is missing, and obtains correct copies of them from other sites. I will describe this in more detail later.
All other errors are just modifications of these, possibly on larger scales.

Now, for the dynamic scenario:

In the dynamic scenario, files are added to the system (though not deleted) by users and third parties, as well as by the servers themselves. For this scenario, I will analyze the possible situations (from a single server's point of view) and the remedial action to be taken:

  1. A file is listed on a "majority" of servers, but not on this one. The correctional software should then simply attempt to write the same file to the server, as if the originator of the file did it.
  2. A file is not listed on a majority, but is listed on this server. Correct behavior is to leave it there. The likelihood of another file colliding with that signature should be low, AND, there is a possibility that the other writes have not completed yet. This also allows people to use replicas as an alternative to local disks that are not replicated.
  3. A deposit is attempted, but the signature does not match the bits. The server should refuse the file with an error.
  4. A deposit is attempted, but the signature scheme is not recognized. The server should refuse with an error code.
  5. A deposit is attempted, and the signature scheme is recognized, but has been declared "old". If the submitter is authorized (via IP address or some other mechanism) the file is accepted. Otherwise, it is rejected, with an appropriate error.
  6. A deposit is attempted, but the signature is already used. Although this is EXTREMELY rare, an appropriate error should be returned, and the file refused.
to add a file, then, it is still required that the user make sure it has been added to at least a majority of servers, though now the definition of majority is not as simple.

There are of course other things needed to ensure correct operation. The server needs to regularly traverse its entire filesystem to detect possible errors. It also needs to check other trusted servers' lists of files to determine possible missing files. Also we would like to place a denial of service watchdog somewhere in the access path on the server (most likely at the presentation end, though this could be changed).

Theoretically, this is enough to ensure reliable replication (though it should of course be tested first). It has been proposed that every {long time} a complete dump might be used to ensure correctness (mostly of the implementations of algorithms). This might be better supported through third parties which can inform sites of possible bugs, though if chosen, i could add that to a possible implementation.

Third Party Services:

To make this replication system act more like systems that are familiar to the world, several third party services have been proposed to improve functionality for the cost of greater complexity and possibility of failure.

The first and most obvious is a Unix Filesystem to Replicated File System mapper. Ultimately, I envision this service operating just like NFS, where a request for a file is sent out, and the file is returned, all without the user aware of all the mappings and signatures etc. This is also one of the first third party services I would like to implement, as it is necessary for our use of this system. The extra file information can be stored as part of the "bits" and some simple encryption mechanism might be used to insure modest security (though not reliable over the large time frame).

Another possibly useful service is a "latest revision" service. This helps sites that would like to provide reliable services, but that depend on some database file they store and could lose. This service receives regular reports of which services would like which id's associated with their latest files. (There are some security issues to address, but solutions for those already exist). The likelihood of both the user and service going down at the same general times is low, and even so, the worst that could happen would be a delay while the server does an exhaustive search on the files. Since this is VERY rare, the tradeoff is worth it.

Many other services could be based on this system, but for now, I think I'd like to build the base system first.

Finally, the protocol design:

Although we haven't designed the protocol fully, the following must be supported:

	1) Queries about the server:
		Media type (?)
		servers it trusts.
		signature schemes it supports. (deposits / reads)
		available storage space (?)
	2) Queries from the server:
		list of all objects
		an object by signature
		existence of object by signature (?)
I think this is enough information to start a conversation as to the functioning of the system. Please ask about any parts I have not made clear or that might have some conceptual flaw..



Last Update: 13 Sept 1994    Return to Library 2000 home page.