Storage-Replication Service in Library 2000
Komkit Tukovinit
kom@mit.edu
This proposal is to fulfill the first part of the requirement of the
6.199 Advanced Undergraduate Project. The proposed project is to
build a scaled-down, working version of the storage-replication
service that will be used in the Library 2000 (L2000) project. The
completed system will be able to receive and store data reliably.
Moreover, a mechanism will be built to simulate faults that can occur
within the storage system.
1. Introduction
The technology of on-line storage, display, and communications will,
by the year 2000, make it economically possible to place the entire
contents of a library on-line, accessible from computer workstations
located anywhere. [Saltzer1] Many engineering issues arise when
building such a library system. One of them is how to preserve the
digital data stored in the system for several decades: a period of
time longer than the lifetime of the storage media.
What could go wrong with the data in a storage system? Something goes
wrong (a fault occurs) when the stored data cannot be accessed or when
the data returned by the system is incorrect. There are many causes
of faults, but they can be broadly categorized as follows [Gray1]:
environments, operations, maintenance, hardware, software, and
process. Examples for each category are listed in Table 1 for
concreteness.
Category Examples
Environments Power failure, weather, earthquakes, fires.
Operations Crucial processes killed, system configured
improperly.
Maintenance Good disks with data taken out to be repaired instead
of bad disks.
Hardware Disks fail, Processor fails.
Software Programs maliciously mutate or destroy data.
Process Shutdown due to administrative decision.
Table 1: Categories of Fault Causes And Their Examples
How can faults be tolerated? Data are replicated so that damage
caused by faults can be repaired. Data can be replicated using
different mechanisms including tape backup and RAID. However, tape
backup does not scale and is prone to operational faults, while RAID
cannot tolerate environmental faults such as a hurricane or
earthquake. Most faults, including environmental faults, can be
tolerated if the failure modes of the replicas are independent.
Independence is achieved by minimizing commonalties among the replicas
so that a fault on a replica cannot affect other replicas. For
example, an earthquake cannot affect more than one replica if all the
replicas are located on different tectonic plates. Verifying the data
on the replicas, repairing the bad data, and propagating changes of
data to the entire system are the responsibilities of
storage-replication service.
2. Storage-Replication Service
Why do we have to design a new storage-replication service when a few
designs [Liskov1, Satyanarayanan1, Page1, Hisgen1, Golding1, Golding2]
have already been done? This is because the demands on the
replication service in L2000 are different from other systems that
researchers have considered. Three points are discussed, the first
being the most important and most detailed.
First, the frequencies of expected events in L2000 system are
different, given the volume and the lifetime of data. Table 2
contrasts the relative frequencies of expected events in most other
file systems with those in L2000. As shown in the table, in the L2000
system, the rate of data decay is expected to equal or exceed data
addition. Therefore, we must emphasize discovering and repairing data
decay, while other systems emphasize, for example, high performance in
obtaining the consensus on data update and/or addition.
Events Other File Systems L2000
Reading data Most frequent Most frequent
Updating data Second most frequent Rare
Adding Data Less frequent Less frequent
Data spontaneously decayed Rare Less frequent
Table 2: Relative Frequencies of Expected Events
Second, the data on L2000 system are expected to last for decades and
to survive the upgrade in storage technologies. The data on the old
storage must be copiedcorrectly and automaticallyto the new storage
when the new storage is installed. In contrast, other file systems
usually require manual copy when the new storage is installed. Third,
the L2000 design should be simple so that the correctness of design is
obvious or can be verified. This property ensures that the data on
the system will survive unaltered through decades.
In the next subsection, the criteria for design are presented. The
possible scope of the 6.199 project then follows. The next two
subsections then present the issues that will remain unaddressed in
the project, and sketch the design of the storage-replication service.
The last subsection details the schedule by which parts of the project
will be completed, and the resources needed to complete the project.
2.1 Criteria for Design [Library2000-1]
2.2 Proposed 6.199 Project
The proposal is to build a replication-storage system that adheres to
the above design criteria as closely as possible. However, because
the scope of an undergraduate project is limited, there are issues
that will remain unexamined. Moreover, the design space may not be
explored thoroughly and the proof of correctness for design and
implementation may not be done.
The system will have the following features. The user will be able to
put data on the system, but the data is unreliable until a unit of
time has passed. After the period, the user can check if the data
remains on the system. If it does, it is highly probable that the
data has been replicated and is reliable; otherwise, the user should
resubmit the data. Once the data has become reliable, the user can
retrieve the data from the system. Also, The system will simulate
faults so that the behavior of the system in reaction to faults can be
observed within any length of time.
2.3 Issues That Will Not be Addressed in The Project
2.4 Design Sketch
There are roughly four parts in the project: verification, repair,
update propagation, and fault simulations. Each of them except fault
simulations is sketched briefly as follows.
2.4.1 Verification
There are two levels of data verification: local verification and
remote verification. Local verification is done by having a daemon
compute checksums of all data every PLOCAL period and compares the
computation of two successive periods. Local verification will be
able to detect most faults excluding the following: some operational
faults (e.g., operator kills the verification daemon), catastrophic
faults (e.g., fire or earthquake), and faults caused by data updates
on replicas that receive updates from other replicas instead of
clients. Remote verification is done by having a replica verify the
data of another replica every PREMOTE period. The replica randomly
picks another replica to verify so that central authority is not
needed. Remote verification will detect most faults that local
verification detects and other faults that local verification cannot
detect. These include when the remote replica has not gotten the
update, when it is unresponsive to requests, and when a large number
of files are corrupted (implying that replication service of the
remote replica malfunctions).
2.4.2 Repair
There are two types of repair: repair that needs synchronous
communication with another replica, and repair that needs synchronous
communications with a quorum of replicas. Only one other replica is
needed when the checksum of the data is known but the data is
corrupted or unavailable. A quorum of replicas is needed when the
checksum of the file is unknown or when another replica requests the
repair. From the quorum, the correct version of data is established
by majority voting using Paxos consensus algorithm [6.826-2]. The
algorithm must be modified to work with quorum consensus.
2.4.3 Update
Updates are propagated to the replicas through verification and repair
mechanisms. The verification mechanism detects inconsistencies on
replicas and the repair mechanism removes the inconsistencies. Hence,
more complicated update propagation, such as rumor mongery [Shroeder1]
and anti-atrophy sessions [Golding1, Golding2], are not needed. For
updates to be effective, the client must send updates to at least a
majority of the replicas. After PUPDATE period, if the updates are
reflected on a majority of the replicas, it is highly probable that
the updates have become reliable. The scheme, however, may require
that the assignment of the identifier for the data, such as a file
name, be centrally controlled.
3. Epilogue
The objective of the design is to come up with a system in which data
persists through time and technological progress in spite of faults.
The proposed design does not address many issues because of the time
constraints, and may not work because there may be some over-looked
issues. However, I hope that through simplicity, it will be obvious
in the implementation if the design works.
Bibliography
[6.826-1] 6.826 Class Handout #47. Replication Techniques. MIT
Laboratory of Computer Science. November 22, 1993.
[6.826-2] 6.826 Class Handout #42. Consensus. MIT Laboratory of
Computer Science. November 15, 1993.
[Golding1] Richard A. Golding. A weak-consistency architecture for
distributed information services. Technical Report UCSC-CRL-92-31 (6
July 1992). Concurrent Systems Laboratory, University of California
at Santa Cruz.
[Golding2] Richard A. Golding and Darrell D. E. Long. The Performance
of Weak-consistency Replication Protocols. Technical Report
UCSC-CRL-92-30 (6 July 1992). Concurrent Systems Laboratory,
University of California at Santa Cruz.
[Gray1] Jim Gray and Andreas Reuter, Fault Tolerance, from Transaction
Processing: Concepts and Techniques, Morgan Kaufmann, 1993, pp.
93-156.
[Hisgen1] Hisgen, Andy, et al. Granularity and Semantic Level of
Replication in the Echo Distributed File System, Proceedings of the
Workshop on Management of replicated Data, Houston, Texas (November 8,
1990), pp. 2-4.
[Library2000-1] Storage Service Replication Design Thoughts, Ideas
from Mitchell Charity, Win Treese, and Jerry Saltzer, Library 2000,
Draft of 1/7/93.
[Library2000-2] Semantics of the Library 2000 Network Storage
Service, Version of March 9, 1993, Notes from meetings and discussion
attended by Mitchell Charity, Quinton Zondervan, Manish Mazumdar, Rob
Miller, Thomas Lee, Win Treese, Ron Weiss, and J. H. Saltzer.
[Liskov1] Barbara Liskov, et al. Replication in the Harp File System,
MIT LCS Technical Report MIT/LCS/TM-456. August 1991.
[Page1] Page, Thomas W., Jr., et al. Architecture of the Ficus
Scaleable Replicated File System. UCLA Computer Science Department
Technical Report CSD-910005, March 1991.
[Saltzer1] Jerome H. Saltzer. LIBRARY 2000, A research
prototype of the on-line electronic library of tomorrow. October
31, 1991.
[Satyanarayanan1] Mahadev Satyanarayanan, et al. Coda: A Highly
Available File System for a Distributed Workstation Environment, IEEE
Transactions on Computers 39, 4 (April 1990), pp. 447-459.
[Schroeder1] Michael D. Schroeder, Andrew D. Birrell, and Roger M.
Needham. Experience with Grapevine: the growth of a distributed
system. ACM Transactions on Computer systems, 2(1):3-23 (February
1984).
Links updated 11/23/97 by jhs
Return to Library 2000 home page.