simple-cluster-heartbeat

Python script to have a singleton service running in a cluster


Keywords
cluster, heartbeat, singleton
License
Apache-2.0
Install
pip install simple-cluster-heartbeat

Documentation

Abstract

Goal is to have a service running on exactly one machine in a two-member cluster. I did not find something like this so I just created it. It is certainly not full-blown (only two cluster members, for example).

Implementation

Generally, the two cluster members are peers and make out the supervisor and worker roles on their own.

Process detection is done via TCP connections. The script allows for the configuration of the command to run, the intervals where peers check their counterpart, and a fallback mode that pulls the command to a preferred machine.

An instance of the script is started on each host. One is elected as the worker, executing a command in a loop. The other instance of the script loops and checks if the worker is still available (by connecting to a network port). Once it detects that the worker is down, it assumes the worker role, starts its own network listener and executes the command in a loop.

Supervisor and Worker election is based on multiple connection attempts to the peer, in random intervals.

This should work on Python 2.6.3+, no other dependecies required. Tested on Windows, Linux, Solaris.

Usage

Usage: ha_heartbeat.py [options] command

Options:
  -h, --help            show this help message and exit
  -l LOCAL_HOST_PORT, --local=LOCAL_HOST_PORT
                        local host and port in <host>:<port> format (default:
                        localhost:22221)
  -r REMOTE_HOST_PORT, --remote=REMOTE_HOST_PORT
                        remote host and port in <host>:<port> format (default:
                        localhost:22222)
  -t INTERVAL_TEST, --interval-test=INTERVAL_TEST
                        Maximum interval between peer checks (default: 10)
  --check-count=CHECK_COUNT
                        Number of times a supervisor will check a dead peer
                        before failing over. (default: 3)
  -i INTERVAL_COMMAND, --interval-commands=INTERVAL_COMMAND
                        Interval between command executions in seconds
                        (default: 30)
  -w WAIT_FOR_SERVERSOCKET, --wait-for-serversocket=WAIT_FOR_SERVERSOCKET
                        Wait seconds for the serversocket to become available
                        (default: 300)
  -m MODE, --mode=MODE  Start process in WORKER or SUPERVISOR mode (default:
                        SUPERVISOR)
  -f, --fallback        Fallback to peer once it is up again (default: False)
  -v, --verbose         Verbose output

Examples

Here are two usage examples:

  • Heartbeat command ("dir c:") executed every 30 seconds, pinned to one machine (m2), i.e., m1 takes over if m2 is down, as m2 comes back, m1 becomes supervisor again:
   h@m1 $ python ha_heartbeat.py -l m1:22221 -r m2:22222 -f "dir c:"
   h@m2 $ python ha_heartbeat.py -l m2:22222 -r m1:22221 -m WORKER "dir c:"
  • Heartbeat command ("ls -l") executed every 10 seconds, peer status checked every 5 seconds, no fallback once the supervisor takes over from a failed worker:
   h@m1 $ python ha_heartbeat.py -i 10 -t 5 -l m1:22221 -r m2:22222 "ls -l"
   h@m2 $ python ha_heartbeat.py -i 10 -t 5 -l m2:22222 -r m1:22221 "ls -l"