Storage-Replication Service

Storage-Replication Service in Library 2000

Komkit Tukovinit
kom@mit.edu

This proposal is to fulfill the first part of the requirement of the 6.199 Advanced Undergraduate Project. The proposed project is to build a scaled-down, working version of the storage-replication service that will be used in the Library 2000 (L2000) project. The completed system will be able to receive and store data reliably. Moreover, a mechanism will be built to simulate faults that can occur within the storage system.

1. Introduction

The technology of on-line storage, display, and communications will, by the year 2000, make it economically possible to place the entire contents of a library on-line, accessible from computer workstations located anywhere. [Saltzer1] Many engineering issues arise when building such a library system. One of them is how to preserve the digital data stored in the system for several decades: a period of time longer than the lifetime of the storage media.

What could go wrong with the data in a storage system? Something goes wrong (a fault occurs) when the stored data cannot be accessed or when the data returned by the system is incorrect. There are many causes of faults, but they can be broadly categorized as follows [Gray1]: environments, operations, maintenance, hardware, software, and process. Examples for each category are listed in Table 1 for concreteness.

Category	Examples
Environments	Power failure, weather, earthquakes, fires.
Operations	Crucial processes killed, system configured
improperly.
Maintenance	Good disks with data taken out to be repaired instead
of bad disks.
Hardware	Disks fail, Processor fails.
Software	Programs maliciously mutate or destroy data.
Process		Shutdown due to administrative decision.

Table 1: Categories of Fault Causes And Their Examples

How can faults be tolerated? Data are replicated so that damage caused by faults can be repaired. Data can be replicated using different mechanisms including tape backup and RAID. However, tape backup does not scale and is prone to operational faults, while RAID cannot tolerate environmental faults such as a hurricane or earthquake. Most faults, including environmental faults, can be tolerated if the failure modes of the replicas are independent. Independence is achieved by minimizing commonalties among the replicas so that a fault on a replica cannot affect other replicas. For example, an earthquake cannot affect more than one replica if all the replicas are located on different tectonic plates. Verifying the data on the replicas, repairing the bad data, and propagating changes of data to the entire system are the responsibilities of storage-replication service.

2. Storage-Replication Service

Why do we have to design a new storage-replication service when a few designs [Liskov1, Satyanarayanan1, Page1, Hisgen1, Golding1, Golding2] have already been done? This is because the demands on the replication service in L2000 are different from other systems that researchers have considered. Three points are discussed, the first being the most important and most detailed.

First, the frequencies of expected events in L2000 system are different, given the volume and the lifetime of data. Table 2 contrasts the relative frequencies of expected events in most other file systems with those in L2000. As shown in the table, in the L2000 system, the rate of data decay is expected to equal or exceed data addition. Therefore, we must emphasize discovering and repairing data decay, while other systems emphasize, for example, high performance in obtaining the consensus on data update and/or addition.

Events				Other File Systems	L2000
Reading data			Most frequent		Most frequent
Updating data			Second most frequent	Rare
Adding Data			Less frequent		Less frequent
Data spontaneously decayed	Rare			Less frequent

Table 2: Relative Frequencies of Expected Events

Second, the data on L2000 system are expected to last for decades and to survive the upgrade in storage technologies. The data on the old storage must be copiedcorrectly and automaticallyto the new storage when the new storage is installed. In contrast, other file systems usually require manual copy when the new storage is installed. Third, the L2000 design should be simple so that the correctness of design is obvious or can be verified. This property ensures that the data on the system will survive unaltered through decades.

In the next subsection, the criteria for design are presented. The possible scope of the 6.199 project then follows. The next two subsections then present the issues that will remain unaddressed in the project, and sketch the design of the storage-replication service. The last subsection details the schedule by which parts of the project will be completed, and the resources needed to complete the project.

2.1 Criteria for Design [Library2000-1]

Works on standard operating system, file system, disk storage system, and networking protocol.

Replicas data may be weakly consistent [Golding1, Golding2, 6.826-1] and updates can be unavailable for a bounded time.

Works with system with at least 1,000,000 Gbytes of data.

Should be made simple by comprising a small number of independent mechanisms whose correctness can be verified.

Lifetime of data exceeds the lifetime of any storage media and the mean time between major disasters.

The data on the system survive if less than N/2 replicas suffer simultaneous faults. N is the total number of replicas.

2.2 Proposed 6.199 Project

The proposal is to build a replication-storage system that adheres to the above design criteria as closely as possible. However, because the scope of an undergraduate project is limited, there are issues that will remain unexamined. Moreover, the design space may not be explored thoroughly and the proof of correctness for design and implementation may not be done.

The system will have the following features. The user will be able to put data on the system, but the data is unreliable until a unit of time has passed. After the period, the user can check if the data remains on the system. If it does, it is highly probable that the data has been replicated and is reliable; otherwise, the user should resubmit the data. Once the data has become reliable, the user can retrieve the data from the system. Also, The system will simulate faults so that the behavior of the system in reaction to faults can be observed within any length of time.

2.3 Issues That Will Not be Addressed in The Project

Whats new mechanism [Library2000-1].

The mapping from identifiers of documents to the data [Library2000-2]. The proposed system will use file names as identifiers for the data.

Algorithm that calculates good checksums of objects that contain large amount of data.

Group membership protocol. The proposed system will have fixed group membership.

Property that updates are either ignored or become reliable in a bounded time. The design has the property that within a bounded time, if the updates are reflected on the majority of the replicas, it is highly probable that the updates are reliable.

Comparison to or usage of update propagation schemes that rely on rumor mongery [Shroeder1] or anti-atrophy sessions [Golding1, Golding2] which may be more effective than the design that will be sketched.

2.4 Design Sketch

There are roughly four parts in the project: verification, repair, update propagation, and fault simulations. Each of them except fault simulations is sketched briefly as follows.

2.4.1 Verification

There are two levels of data verification: local verification and remote verification. Local verification is done by having a daemon compute checksums of all data every PLOCAL period and compares the computation of two successive periods. Local verification will be able to detect most faults excluding the following: some operational faults (e.g., operator kills the verification daemon), catastrophic faults (e.g., fire or earthquake), and faults caused by data updates on replicas that receive updates from other replicas instead of clients. Remote verification is done by having a replica verify the data of another replica every PREMOTE period. The replica randomly picks another replica to verify so that central authority is not needed. Remote verification will detect most faults that local verification detects and other faults that local verification cannot detect. These include when the remote replica has not gotten the update, when it is unresponsive to requests, and when a large number of files are corrupted (implying that replication service of the remote replica malfunctions).

2.4.2 Repair

There are two types of repair: repair that needs synchronous communication with another replica, and repair that needs synchronous communications with a quorum of replicas. Only one other replica is needed when the checksum of the data is known but the data is corrupted or unavailable. A quorum of replicas is needed when the checksum of the file is unknown or when another replica requests the repair. From the quorum, the correct version of data is established by majority voting using Paxos consensus algorithm [6.826-2]. The algorithm must be modified to work with quorum consensus.

2.4.3 Update

Updates are propagated to the replicas through verification and repair mechanisms. The verification mechanism detects inconsistencies on replicas and the repair mechanism removes the inconsistencies. Hence, more complicated update propagation, such as rumor mongery [Shroeder1] and anti-atrophy sessions [Golding1, Golding2], are not needed. For updates to be effective, the client must send updates to at least a majority of the replicas. After PUPDATE period, if the updates are reflected on a majority of the replicas, it is highly probable that the updates have become reliable. The scheme, however, may require that the assignment of the identifier for the data, such as a file name, be centrally controlled.

3. Epilogue

The objective of the design is to come up with a system in which data persists through time and technological progress in spite of faults. The proposed design does not address many issues because of the time constraints, and may not work because there may be some over-looked issues. However, I hope that through simplicity, it will be obvious in the implementation if the design works.

Bibliography

[6.826-1] 6.826 Class Handout #47. Replication Techniques. MIT Laboratory of Computer Science. November 22, 1993.

[6.826-2] 6.826 Class Handout #42. Consensus. MIT Laboratory of Computer Science. November 15, 1993.

[Golding1] Richard A. Golding. A weak-consistency architecture for distributed information services. Technical Report UCSC-CRL-92-31 (6 July 1992). Concurrent Systems Laboratory, University of California at Santa Cruz.

[Golding2] Richard A. Golding and Darrell D. E. Long. The Performance of Weak-consistency Replication Protocols. Technical Report UCSC-CRL-92-30 (6 July 1992). Concurrent Systems Laboratory, University of California at Santa Cruz.

[Gray1] Jim Gray and Andreas Reuter, Fault Tolerance, from Transaction Processing: Concepts and Techniques, Morgan Kaufmann, 1993, pp. 93-156.

[Hisgen1] Hisgen, Andy, et al. Granularity and Semantic Level of Replication in the Echo Distributed File System, Proceedings of the Workshop on Management of replicated Data, Houston, Texas (November 8, 1990), pp. 2-4.

[Library2000-1] Storage Service Replication Design Thoughts, Ideas from Mitchell Charity, Win Treese, and Jerry Saltzer, Library 2000, Draft of 1/7/93.

[Library2000-2] Semantics of the Library 2000 Network Storage Service, Version of March 9, 1993, Notes from meetings and discussion attended by Mitchell Charity, Quinton Zondervan, Manish Mazumdar, Rob Miller, Thomas Lee, Win Treese, Ron Weiss, and J. H. Saltzer.

[Liskov1] Barbara Liskov, et al. Replication in the Harp File System, MIT LCS Technical Report MIT/LCS/TM-456. August 1991.

[Page1] Page, Thomas W., Jr., et al. Architecture of the Ficus Scaleable Replicated File System. UCLA Computer Science Department Technical Report CSD-910005, March 1991.

[Saltzer1] Jerome H. Saltzer. LIBRARY 2000, A research prototype of the on-line electronic library of tomorrow. October 31, 1991.

[Satyanarayanan1] Mahadev Satyanarayanan, et al. Coda: A Highly Available File System for a Distributed Workstation Environment, IEEE Transactions on Computers 39, 4 (April 1990), pp. 447-459.

[Schroeder1] Michael D. Schroeder, Andrew D. Birrell, and Roger M. Needham. Experience with Grapevine: the growth of a distributed system. ACM Transactions on Computer systems, 2(1):3-23 (February 1984).

Links updated 11/23/97 by jhs
Return to Library 2000 home page.