Byzantine Fault Tolerance

In distributed computer systems, Byzantine Fault Tolerance is a characteristic of a system that tolerates the class of failures known as the Byzantine Generals' Problem; for which there is an unsolvability proof.

The Byzantine Generals' Problem

On July 5th 1982, Leslie Lamport (initial LaTeX developer, Microsoft Researcher and winner of the 2013 Turing Award), Robert Shostak and Marshall Pease published a paper named The Byzantine Generals' Problem.

The group devised a thought experiment for an abstract agreement problem.

They imagined that several divisions of the Byzantine army are camped outside an enemy city, each division commanded by its own general. The generals can communicate with one another only by messenger. After observing the enemy, they must decide upon a common plan of action.

In its simplest form, the generals must only decide whether to attack or retreat. Some generals may prefer to attack, while others prefer to retreat. The important thing is that every general agrees on a common decision, for a half hearted attack by a few generals would be worse than a coordinated attack or a coordinated retreat.

The Byzantine Generals' Problem Illustration

Since it's impossible to know which generals are traitors trying to prevent the loyal generals from reaching agreement, the generals must have an algorithm to guarantee that all loyal generals decide upon the same plan of action and that a small number of traitors cannot cause the loyal generals to adopt a bad plan.

The Traitorous General

If nine generals are voting, four of whom support attacking while four others are in favor of retreat, the ninth general (traitorous general) may send a vote of retreat to those generals in favor of retreat, and a vote of attack to the rest. Those who received a retreat vote from the ninth general will retreat, while the rest will attack.

The Traitorous General

The Traitorous Messenger

To make matters worse, the generals are physically separated and have to send their votes via messengers who may fail to deliver votes or may forge false votes (traitorous messenger).

The Traitorous Messenger

What is Byzantine Failure?

The typical mapping of this story onto computer systems is that the computers are the generals and their digital communication system links are the messengers.

Simply put, a Byzantine Fault is a fault that presents different symptoms to different observers. Similarly, a Byzantine Failure is the loss of a system component due to a Byzantine Fault in a distributed system that requires consensus.

So it stands to reason that the objective of a Byzantine Fault Tolerant system is to be able to defend against Byzantine failures.

A correctly implemented Byzantine Fault Tolerant system should be able to still provide service, assuming that the majority of the components are still healthy.

Achieving Byzantine Fault Tolerance

Several system architectures were designed that implement Byzantine Fault Tolerance. Implementations are very specific to their use case. Nonetheless, there are 2 prominent solutions that these systems may end up implementing:

  1. Unforgeable message signatures. This may be achieved by using Public-Key Cryptography (see Introduction to the Diffie-Hellman Key Exchange).
  2. Atomic Broadcasts. If the message system is such that the command is transmitted
    simultaneously to all participants, then A cannot send a different message to C and B.

These solutions are not mutually exclusive, so systems that need to be highly byzantine fault tolerant usually end up implementing a variation which includes both.

Real World Examples of Byzantine Failure

On NASA's DASHlink, there is a collection of real system failures that they encountered. These web pages also describe some phenomenology that can cause Byzantine faults.
NASA Observed Failures

Byzantine Fault Tolerance in Practice

Bitcoin

Bitcoin, the peer-to-peer cryptocurrency that's based on a blockchain, is Byzantine Fault Tolerant. The bitcoin network works in parallel to generate a chain of Hashcash style proof-of-work (also known as CPU Cost Function, Client Puzzle, Computational Puzzle or CPU Pricing Function). This proof-of-work chain is the key to overcome Byzantine failures and to reach a coherent global view of the system state.

Aircraft

The Boeing 777 Aircraft Information Management System and the flight control system use Byzantine Fault Tolerance. Because these are real-time systems, their Byzantine fault tolerance solutions must have very low latency. In fact, Boeing states that SAFEbus, a standard backplane
bus for commercial avionics, can achieve Byzantine fault tolerance on the order of a microsecond of added latency.

In an Embedded Linux Conference in 2013, SpaceX Engineer Robert Rose, briefly described the SpaceX Dragon flight system, though he couldn't give too many details. He stated that it is a fault-tolerant system in order to satisfy NASA requirements for when it gets close to the ISS.
There are rules about how many faults a craft needs to be able to tolerate and still be allowed to approach the station. It uses triply redundant computers to achieve the required level of fault tolerance. The Byzantine Generals' algorithm is used to handle situations where the computers do not agree.
That situation could come about because of a radiation event changing memory or register values, for example.