Storage Service Replication
Design Thoughts

Ideas from Mitchell Charity, Win Treese, and Jerry Saltzer
Library 2000

(Draft of 1/7/93)

------------------------------------------------------------

Criteria for design:

1.  The storage service should be able to work on top of any standard
operating system, file system, and disk storage system without
requiring change or special features.

2.  In this application instant availability of updates to reading
clients is not critical.  Specifically, if the updating mechanism is
out of service, prospective updates can be queued by the updating
clients, as long as the time of queuing is bounded.

3.  Similarly, it is not essential that all replicas return identical
answers with respect to recently changed data.  Specifically, it is
acceptable that updates are visible at one replica before they are 
visible at another, as long as the time of disparity is bounded.

4.  The intended scale and consequent configuration look like this:

    Design center:  1000 maximum-size disk drives.
                    1992:  1000 X 2 Gbytes.  20 Million page images
                                             (50,000 books)
                    2000:  1000 X 100 Gbytes.  1 Billion page images
                                             (2.5 Million books)
    Design limit:   At least 10 times larger.

At the design center, a 1992 server site configuration might consist
of about 20 server processors, each with 50 5-inch Winchester disks
and 128 Mbytes of memory, interconnected with an Ethernet.  A year
2000 server site would probably have the same number of (faster)
processors and (bigger) disks, with more memory and a
higher-performance network.  The currently planned L2000 prototype is
about one tenth the size of the design center.  Both in 1992 and in
2000 a server site consists of about $2.5M worth of equipment.

5.  Simplicity:  Because it is difficult to be confident that a large
scale system is actually operating the way one intended, the
replication scheme should comprise a small number of relatively
orthogonal, independent mechanisms that obviously do the right thing
and that can easily be verified to be working when in the field.  Thus
simplicity is strongly preferred over a more efficient but more
complex mechanism, or even a very elegant, but potentially fragile,
design.

6.  All interactions among replicas should be accomplished via
(extensions to) the storage service protocol, rather than by
introducing a new protocol or introducing dependence on some other
network protocol.

7.  The data is assumed to have a lifetime far exceeding the expected
lifetime of any particular storage medium and also far exceeding the
mean time between major disasters (hurricane, earthquake, civil
unrest) that may completely destroy a storage site.

8.  All expected events are to be tolerated without loss of data and
also logged for analysis of their occurrence rate and pattern.  To the
extent physically possible, all intolerable events should be detected
and reported.

------------------------------------------------------------

Expected events:

1.  A disk returns data values different from those written on it.
The erroneous data may or may not be accompanied by an error alert.

2.  A disk returns correct data.  The correct data is accompanied by
an error alert.

3.  A disk goes up in smoke; the data on it is gone forever.

4.  A disk that had previously been handled as irreparable is
repaired, and some or all of the data is was storing is now
potentially available.

5.  A site becomes disconnected from the network temporarily.

6.  A site (or its network attachment) is damaged so badly that it
cannot be repaired soon enough to be useful as a participant in the
replication strategy.

7.  A badly damaged site is eventually repaired and some (perhaps all)
of the data that was stored there again becomes potentially available.

8.  A new site is added to the replication system.

9.  The disk storage devices at a site are completely replaced by a
new generation of storage devices.

Intolerable events:

1.  Expected events at a majority of sites cause the same data to be
inaccessible (or destroyed) in the same 24-hour period.

---------------------------------------------------------------

Derivation of plan.

1.  There are N replicants.  Each replicant is a complete, independent
storage site, separated geographically from the other replicants.
Each replicant maintains a complete copy of the data.  The
geographical separation and number of replicants is chosen to ensure
that any single large-scale disaster does not affect a majority of the
replicants.

2.  A configuration record, with a well-known UID and available from
any of the replicants, lists the names of the replicants.

3.  Assume first that the data never changes.  Each replicant
maintains, in addition to the data, three values that it makes
publicly available:  the checksum of the data, the timestamp of the
last time it tried to verify the checksum, and a verification
success/failure indicator.

4.  Once every check period (assume 24 hours,) each replicant does the
following:

   -  Try to calculate a consensus value for the checksum by reading the
      posted checksums, timestamps, and success/failure indicators of all
      replicants.  
         IF the posted timestamp is less than 2 check periods old AND the
            posted indicator says success, then accept the checksum.
         IF the posted checksums of more than half the replicants are
            acceptable AND agree in value, than take that value
            as the consensus value.
         OTHERWISE call for help (intolerable event detected)

   -  Read all the data at this replicant and compute a new tentative
      value for the checksum.  (N.B., ignores read errors!)

   -  Compare this new tentative value with the consensus value.
         IF identical, force success = true,
         Otherwise,  force success = false
         IF the consensus value differs with the local posted checksum,
            replace the local posted checksum with the consensus value.
         Post the new timestamp.

   -  if success = false, get data from at least two of the sites that
      agreed in the consensus, compare, and if identical replace the
      local copy with the new data.  If they are not identical, call
      for help (intolerable event detected.)

   -  after the repair is complete, immediately repeat the
      verification procedure.

Goal:  as long as fewer than half the replicants have a data failure
in a given 24 hour period, the algorithm restores the data to the
correct value at all replicants.

5.  In the course of ordinary operations, an attempt to read may return an
error.  Each such error is posted to a log, and following a
verification sweep each item added to the log since the previous
verification is read to see if it still returns an error.  If it does,
call for help.

----------------

Elaboration, with more files:

6.  One checksum per file.

----------------

Elaboration, with update:

7.  One of the replicants is designated Master; the remaining
replicants are Slaves.  This designation is added to the file that
lists the names of the replicants.  When a client creates, modifies,
or deletes a data record, it does so by contacting the Master.  When a
client reads a data record, it does so by contacting any replicant.

3.  A record is stored as a named directory containing files in the
local file system.  For each record, the system stores {UID, Version,
contents}.  UID and Version are encoded into the directory name.

4.  When any replicant, whether Master or Slave, creates a new (or
deletes an old) directory or a file within a directory, in addition
to doing so it makes a note in that replicant's Add/Delete log
consisting the directory's (file's) name and a check value on its
contents (for example an 8-byte CRC).

5.  Replicants perform modifications, either to a directory or to the
files stored therein, by creating a new version of the directory or
file, deleting the old version, and placing two entries in the
Add/Delete log.

6.  At each replicant an array of status summary demons run
continuously at low priority.  They are launched with three
parameters, a node in a storage hierarchy, the name of an output file
outside that storage hierarchy, and the name of the Add/Delete log.
Each demon scans the given hierarchy and for each file found it reads
the file and recalculates the check value.  Its output consists of a
dated file containing {tree name, check value} pairs.  The status
summary demon logs to the Add/Delete list whenever it begins or ends a
status scan.  The number of demons is chosen to assure that all data
under protection of the service is scanned every N time units.  For
discussion, assume that N is 24 hours.

7.  At each replicant a status review demon runs whenever a status
summary demon creates a new status summary file.  This second demon
compares the new status file with the next older status file (detailed
algorithm to be defined) and looks for differences.  If there are no
differences, it marks the new status file as confirmed and forwards
copies to each of the replicants.  If there are any differences it
prepares a difference report log and a repair plan.  There can be
three kinds of differences:

     1.  Bit failure.  File appears in both new and old status lists,
         but with different check values.

     2.  Mysterious disappearance.  File appears in old status list
         but not new.

     3.  Mysterious appearance.  File appears in new status list but
         not old.

Because files are never modified, all bit failure differences are by
definition media failures that require repair, and go into a repair
list.

8.  Reference to the add/delete log allows the status review demon to
separate mysterious appearances and disappearances into failures and
updates.  In order to be an update, it must appear in the add/delete
log after the previous status summary started.  For new files the
check value in the add/delete log should match the check value of the
mysteriously appearing file.  (The date-checking algorithm needs some
careful thought.)  Mysterious appearances and disappearances that
cannot be accounted for by review of the add/delete log, and
mysterious appearances whose check values do not match check values of
the log go into the repair list.

9.  At each Slave replicant a what's-new demon takes the initiative to
ask the Master replicant for a list of changes to its database.  It
then updates that Slave replicant's database to include the changes
listed in the what's-new list.  This is the mechanism by which copies
are propagated to the replicants.  Each Slave replicant maintains both
a database and an add-delete log that is identical to, though perhaps
lagging, the corresponding database and add-delete log of the Master.

10.  A client may ask what's-new of either the Master or a Slave,
interchangeably; the only difference is that a Slave may respond with
an older last-transaction-id than would the corresponding Master.  The
one slightly odd case is when a client has asked a master what's-new,
and soon thereafter asks a Slave; the client must be preparared for
the possibility that the last-transaction-id returned by the slave is
older than one already received from the master.

11.  A repair demon services the repair list.  Bit failures and
mysterious disappearance failures are corrected by getting a good copy
from another replicant.  Mysterious appearance failures are corrected
by deleting them.  All corrections are logged to a correction log
rather than the add/delete log.

12.  A reconcilation demon reviews the status summary files of the
other replicants and compares them with this replicant, looking for
differences that cannot be accounted for by recent update transactions
that have not yet been propagated.  Whenever a status summary arrives
from a sibling replicant the reconciliation demon compares it with the
most recent local status summary.  In this comparison three kinds of
difference may appear:

     1.  File present locally but not in sibling.  Ignore.  (The sibling
         will pick this up and repair it as a difference of type 2.)

     2.  File present in sibling but not locally.  Post to the repair
         list, with the intent that the repair demon will get a copy
         from the sibling.

     3.  File present both places, but with different check values.
         Post for human intervention.  (Possible automatic N-way vote
         to decide which copy is good.)

---------------

To do:

The reconcilation demon's repair procedure is redundant with the
what's new mechanism.

Newly-created and newly-modified files should start in a "fragile"
state and later enter a "robust" state.  The property of the robust
state is that a file in that state is guaranteed to be preserved in
the face of all reasonably contemplatable failures.  (Definition needs
to be sharpened up.)

The client that creates or modifies a file is responsible for checking
back a day or so later to verify that the file has evolved to the robust
state.


A replicant is probably an array of servers.  Coordination of the
members of the array hasn't been addressed.

Need a plan for handling of a completely dead master (redesignation of
one of the slaves as master, creation of a new slave site, and what to
do if later the old master comes back up at least far enough to look
at its log.)

Doesn't handle corrupted logs.


Return to Library 2000 home page.