Proactive Recovery in Real-Time Mission-Critical Systems

Contact: Maaz Mashood Mohiuddin

Background:

Fault-tolerance architecture use replicas of a potentially faulty application with the aim that given enough replicas and a profile for arrival of faults, at any instant of time, at least one replica will be non-faulty. Furthermore, previous studies have shown that the likelihood of a replica turning faulty is dependent on how long the replica has been running. Hence, proactive recovery techniques have been developed in order to predict when a replica will be turning faulty and proactively reset it beforehand. While the idea is simple, its execution requires:

  • Reliable prediction when a replica will turn faulty
  • Reliable estimation of recovery time
  • Ensuring that availability remains unaffected during a recovery

Failure to ensure any of these conditions makes proactive recovery counter-productive. The issue is amplified in the context of real-time mission-critical systems, as the margin for error in estimation and prediction is low, and the consequence of unavailability is severe. In this project, we aim to design an effective proactive recovery strategy for such systems.

Project Goals:

  • Review of existing proactive recovery mechanisms
  • Design of proactive recovery strategy for real-time system
  • Implementation of the design

Required Skills:

  • C/C++
  • TCP/IP
  • Fault-tolerance

Supervisors: Maaz Mashood Mohiuddin, Wajeb Saab