Reliable Fault-Recovery for Asynchronous Systems ‒ LCA2 ‐ EPFL

Background:

Fault-tolerant architectures rely on reliable fault-recovery mechanisms for correctness. A reliable fault-recovery mechanism provides two things: a) a faulty replica is rebooted within bounded-delay, b) a non-faulty replica is never wrongly rebooted.

To ensure this, the existing fault-recovery mechanisms require that the replicas have a consistent view of some common state, such as global time-synchronization, information about if a replica is faulty or not, etc. However, maintaining this kind of information across all replicas at all times is not possible due to unexpected events such as network partitions, replica skew, etc.

To address this issue, we present a reliable fault-recovery mechanism that does not require replica synchronization. Instead of using existing state from synchronization to mark a specific recovery event, our mechanism creates a short-lived session using a 4-way handshake between the recoverer and the recoveree. Each session uniquely marks a recovery event thereby ensuring that a recoveree is recovered by a particular recoverer only once.

We compare the correctness properties, namely availability, safety and liveness of this mechanism with other fault-recovery mechanisms. We also evaluate its average and worst case performance with existing fault-recovery mechanisms.

Project Goals:

Implementing the proposed fault-recovery mechanism.
Deriving its correctness properties: safety, availability, and liveness.
Surveying state-of-the-art fault-recovery mechanisms.
Evaluating the performance of the proposed mechanism with respect to existing mechanisms.

Required skills:

The implementation of the fault-recovery mechanism can be done in the programming language of choice of the student. However, strong programming skills are required.

Supervisors: Wajeb Saab, Maaz Mashood Mohiuddin