Search This Blog


Fault-Tolerant Systems

Fault-Tolerant Systems

General Description

Computers and networks are increasingly used in critical applications, where system failures can be expensive or even catastrophic. Example applications include aircraft fly-by-wire control, automobile control, computers used in medical systems, spacecraft, and databases in a large variety of financial and enterprise applications. The overall reliability expected of a computer system in these applications far exceeds that of any individual computer. This course is about how to build a highly reliable system that continue to function acceptably even after a number of its components (hardware or software) have failed

Main Topics
  • Introduction to fault tolerance.
  • Measures of fault-tolerance.
  • Exploiting and managing redundancy in:
    • Hardware.
    • Software.
    • Time.
    • Data.
  • Network fault tolerance.
  • Issues in distributed systems.
    • Byzantine generals algorithm.
    • Fault-tolerant clock synchronization.
    • Reliable remote procedure calls.
  • Reliability evaluation techniques.
Slides: HWFT Part 1
Slides: HWFT Part 2
Slides: HWFT Part 3
Slides: Networks Part 1 
Slides: Networks Part 2 
Slides: Networks Part 3 
Slides: Data Replication
Slides: Checkpointing Part 1 
Slides: Checkpointing Part 2 
Slides: Checkpointing Part 3 
Slides: Coding
Slides: Coding Part 2
Slides: Software Fault Tolerance Part 1
Slides: Software Fault Tolerance Part 2
Byzantine Generals Algorithm
Slides: Byzantine Generals algorithm