Survive Protocol

This document describes the election protocol survive. It is intended to select a single active instance from a cluster of instances with fast convergence if the active instance fails. All members of a cluster are well-known and must be able to reach each other bidirectionally to prevent the cluster from splitting up.

Terms and Definitions

Term	Definition
FSM	finite-state machine
check	a condition that is continuously tested on a node
cluster	a group of instances
instance	a survive FSM running on a node
node	a host with instances of independant clusters

Design

Survive is designed to ensure that a single active instance is selected from a cluster of instances. As there is no quorum-like mechanism used it is acceptable that a cluster of instances may partition into multiple clusters, each cluster electing an active instance.

In the worst case a cluster of n instances can partition into n clusters, each with a single active instance. An application which relies on the active instance selection must be able to handle such situations gracefully or implement additional locking.

Overview

A cluster of instances acting under a survice protocol instance are identified by:

an unique instance ID
(uint64)
a PSK used for HMAC-based authentication
reachability information of instances in a cluster
(IP addresses and port)

The instances needs to comply the following assumptions:

they may vanish at any time (i.e. due to node failures)
they will not make Byzantine failures
they can communicate with each other over IP, although the communication may not be always reliable

State Engine

A survice instance implements the following FSM:

stateDiagram-v2 [*] --> INIT INIT --> STANDBY INIT --> FAILED STANDBY --> ACTIVE STANDBY --> FAILED ACTIVE --> STANDBY ACTIVE --> FAILED FAILED --> INIT FAILED --> [*]

INIT

This is the initial state when a survive protocol instance starts.

Possible transitions to:

FAILED: when any critical check fails
STANDBY: when all critical check passes

STANDBY

In this state the instance will participate in cluster communitcation and is ready to take over if the active instance fails.

Possible transitions to:

FAILED: when any critical check fails
ACTIVE: when this instance is elected

ACTIVE

In this state the instance is the active instance of the cluster.

Possible transitions to:

FAILED: when any critical check fails
STANDBY: when this instance is preempted by another instance.

FAILED

This state is reached when a critical check does not pass or the instance is going to shutdown. The instances on other nodes get a notification that the instance has left the cluster for faster convergence time.

Possible transitions to:

INIT: when all critical checks have been passed
termination: during instance shutdown