Yoav's Replication Notes, July 22

Date: Fri, 22 Jul 1994 19:08:58 -0400
From: yoav@ltt2.lcs.mit.edu
Message-Id: <9407222308.AA14747@ltt2.lcs.mit.edu>
Subject: replication 2nd installment

Okay.. the first message was rather short, so I suppose here I will try to explain what was misundersttod in the first, as well as better explain what was left out.

First, for the questions:

you suggest that instead of checking files as they are transmitted, the server should instead observe a read error and use that to detect failures instead. I think this leads into a question of how prudent the server should be. It is conceivable that an operator decided to remove the file, or modify it somehow from within the server. This would not lead to a read error. However, this would still be detected by the (to be discussed) update cycle. In practice, the likelihood of such a mistake is probably low, and so I would agree that a disk failure on a sector or track is much more likely, and therefore the server should only bother testing the read. (if we notice that this is not true later, we can always add extra checking code).

You also suggest deviding the system into storage, failure discovery, and failure recovery. Although the discovery and recovery system are heavily interconnected, I agree that seperating them is conceptually a good idea (as well as enforcing better structure for the actual implementation).

In the letter I did not expand into the security concerns we discussed here, and so although I mentioned the assumption of lack of security (which leads to the decision for immutability of objects in the system), I never explained what was on our mind.. here it is:

Basically, we were concerned that over a network, it is hard to authenticate a person with an algorithm that lasts on our timeframe. This leaves us with two options. the first is a regular update of the algorithm. This increases the probability of bugs, and a weakness or bug in the algorithm here would allow people to forge delete requests. An angry person should never be allowed to erase the constitution or the Satanic Verses. The alternative is to simply not allow any deletions to occur. This seems sensible, but poses two distinct problems. The first is the new restriction disallows authentic updates to a file. However, this can be overcome by the use of a third party, which can always point to a new version (thereby moving the security concerns to it, and anybody who wishes to not trust it, can attempt to devise their own update schemes). As for size, delta encoding could be used to minimize space costs. The other effect of this requirement is the inability of a replic to 'throw away' data that the other replicas do not have. This, however, is also a feature, since the data is guaranteed correct, (due to later discussed signature scheme), and is just not in the other replicas. This can be used to allow people to deposit files they do not want replicated, or to allow people to take time in updating a majority. If you know a file has been deposited in a replica, you know it will stay there for as long as the replica can hold it. It is still your job to make sure a majority of the replicas receive this file befo re you can assume it will be preserved.

the signature:

we chose the signature as the way to insure a unique namespace (so that nobody has to worry about names mutating or conflicts between two parties for the same name). We assume that an algorithm could be designed such that engineering a file from the signature was impossible (at the time of deposit), and add the version identifier so that algorithms can be changed with time. Since we no longer have to worry about people intentionally overwriting files, the algorithm does not have to be extremely strong, but more of a deterrent from attempting to cause collisions and faking files.

a signature is created from the contents of the file, and is used as a name for a flat namespace (since the signature is very large, the implementation will make a mapping from signature to internal name, but that is not relevant for design). A problem with this is in new versions of files. A file can no longer be named "foo.2" to imply a newer version. We chose not to add another field to the signature implying version, but instead use a thrid party service that helps many different services and clients decide on the latest version of their files. This third party collects updates from clients, and stores locally those updates. The clients can make requests for their latest files, which it can immediately inform them of. In turn, this update service also uses the replica system to write out its own data (checkpoints). This data is not refered to unless a combination of loss of data AND request for version is made. Then, a the service has to request EVERY file from the server and test it (using some secure algorithm) to see if it is the latest version of its data. (very unlikely scenario, but one that may lead to loss of some data by the clients).

namespace:

since we basically assign a signature to every file, it is functionally a flat namespace. The protocol has not been designed yet, but there are two options to requesting a complete list of names. one is single request single answer (complete dump of names). The other is like a dbm approach: give me the first id.. give me the id following id... so that for every new name, a new request is made. I favor the first, but depending on usage and network load, the latte might be a better bet (perhaps using blocks of ids instead of one by one).

(question):

in relation to the concept of "trust" you said "Concept not found in sequel." were you referring to the database language? I didn't follow the question, so am unable to explain any better. I suspect you might want to discuss this concept with mitchell, as I think he understands it better than me.

> Absolutely not true.  By reference to another replica you could figure out
> which is right.  What you mean to say is that you CHOOSE not to
> distinguish, for simplicity.

I stand corrected.. yes, I choose to not distinguish, since the method of repair is simplest when the two are dealt with similarly.

replication of files:

In the letter I send I explained that a single file that does not exist on other replicas should be left there (presumably forever). you noted that this would add unneeded complexity, and claim that either the file should then be copied into every replica or blown away. Since blowing the file away conflicts with the previous design (no object is to be removed if the signature and object match), we would need to either modify the design to allow for that exception, or force the file into every replica. I suspect leaving it there and not forcing into the others is still a viable concept, but I strongly suggest mitchell step in here and comment, since he made the strong argument for not deleting the file

>>        6) A deposit is attempted, but the signature is already
>>used. Although this is EXTREMELY rare, an appropriate error should be
>>returned, and the file refused.

>If the repair system copies files that the user hasn't gotten around to
>depositing on all the replicas, this event will be common.

true.. I misstated.. an error should not be reported if the signature AND the data match, only if the two do not match should it be refused. this leads to a possible problem. if two real files are deposited neither can win the majority (since neither can force the deletion of the other) the only real answer to this is that the probability of such an event occuring (two similar ID's in less than the propagation time) is SO low that ten million universes are more likely to end etc...

>>        to add a file, then, it is still required that the user make
>>sure it has been added to at least a majority of servers,

>That contradicts your previous statement that a file, if found only one
>place, stays there.  I prefer this one.

again, I should be careful when I make statements such as that. what I meant was: to add a file so that it may be replicated, a person must add it to a majority within the trust set.

finally, concerning complete dumps. They are not really necessary. what would be involved is a random file being chosen (in such a way that all files will be covered by the algorithm) and read, so that its contents may be compared to the local contents. considering we assume the signature guarantees correctness of the file, I assume that level of guarantee is not necessary (especially since most failures are not related to bits changing, but entire data being corrupted

Now. onto a slightly better discussion of how things work (this is going to be slightly general, and not oriented like an analysis paper, so please forgive if it rambles or is unclear.)

sevearl machines are set up to be replicas within distinct trust sets (this set up is VERY important - it can set up mirrors and real replicas, but a mistake here can lead to bizarre behavior). for example

  /---\	     /---\
  | A |------| B |
  \---/	     \---/
    |          |  
    |	       |  
  /---\	     /---\    /---\
  | C |------| D |----|	E |
  \---/      \---/    \---/

is a setup, with the following trust sets:

A: B, C, D

B: A, C, D

C: A, B, D

D: A, B, C

E: D

if the servers are queried for their trust set, each returns the respective values. This translates to a true replica formed by A, B, C, D. plus a mirror of the replica (in truth a mirror of D). so. an addition to any one of these would stay there, plus an addition to D would get copied to E. (E might refuse any additions from other addresses so it could be a true mirror site.)

an addition to two of A, B, C, or D, would result in ALL the sites having the object. These sites are internally something like this:

connections are made by an external person. there is no authentication necessary, but service can be denied at this stage (IP address based for example).

once a connection is made, the protocol requests proceed. Again, deposits and possibly reads can be denied based on IP address or such.

objects can be deposited, retrieved, and lists of all objects requested.

information about the server may also be requested: the trust set, the uptime/downtime ratio or disk drive MTTF or some other quality method (we're still discussing this). currently accepted signature schemes.

then, a file is deposited. It must be deposited with an acceptable signature scheme, and must be deposited with the same scheme on all other replicas in the set. (schemes should rarely change, and changeover times should be designed so that nobody would be unable to deposit a file for replication). Although nothing enforces this, it is in the advantage of the user, since two files with the same bits but different signature schemes are considered different

on a regular basis (unspecified at the moment), the replica traverses its namespace, looking for errors (either read failures or signature mismatch - I suspect perhaps both is our best bet), when it finds them, it takes the action specified in the previous mail (usualy spooling the file offsite).

also on a regular basis, the replica requests a listing of files from its trust set. using the voting algorithm described previously, it decides which files it is missing and should have, and proceeds to obtain those files. Since these files may have signatures from old schemes the replica no longer accepts, a special case for writes has to be made for this update process. the replica will refuse additions using old schemes except when it itself is adding a file from another replica in its trust set.

Third party services:

NFS to RepFS mapper:

This is probably the most useful service, and currently the one I fear most to write... Basically, it creates another interaction layer between clients and the replicas, bascially so that any NFS capable system (such as UNIX) could interact with RepFS without editing any code. Basically, this system makes a map between the original filename submitted, and the resulting bits it creates (these bits include a lot of information about the file). it then adds it to the internal mapping, so that a request for the file can be processed and dealt with as if it was an NFS mounted file system.

revision server:

This was described before.

Client services:

I distinguish this from thirs party services merely in the fact that it would be preferable for a client to do this, although it could be set up as a third party service.

some service that attempts a write to a majority in a trust set, and keeps trying until it is convinced a majority have the files.

Issues to consider:

should replicas be able to change their trust sets based on previous histories? I currently believe this should be done rarely, and not automatically, for fear of possible trust set failures.

how often should a replica traverse its trees? should a replica make a guarantee of how long it would take the file to propagate to all other servers in the set?

protocol: not much time has been spent on exactly how the protocol should function.

many more issues that I probably forgot...

anyway, I hope you and mitchell both get this (I forwarded your reply to him so that he could see it too)

oh, and I've started thinking and trying ut several approaches for coding desing. I'm currently favoring an object oriented approach (C++ at the moment) and am testing out some concepts. This i believe will make it easier to modify stuff.. a signature is a class, a fitem contains the local filename, and a signature, a replica is composed of fitems, etc.. I hope this would make it a lot easier to modify and play around with the code, as well as help generate profiling and debugging data and logs.

well, anyway, the above is just me trying to set more concepts straight in my mind (code design helps me notice logic flaws sometimes). but I would really like to have a design to start working on soon. And I suspect mitchell would like to have some paper on this topic soon to.

anyway, comments and ideas appreciated..

yoav

Last Update: 13 Sept 1994 Return to Library 2000 home page.