Supervisor Health Checks
Framework to build health checks for Supervisor-based services.
Health check programs are supposed to run as event listeners in Supervisor environment. On check failure Supervisor will attempt to restart monitored process.
Here's typical configuration example:
[eventlistener:example_check]
command=python <path_to_supervisor_check_program>
stderr_logfile = /var/log/supervisor/supervisor_example_check-stderr.log
stdout_logfile = /var/log/supervisor/supervisor_example_check-stdout.log
events=TICK_60
Here's the list of check programs package provides out-of-box:
- supervisor_http_check: process check based on HTTP query.
- supervisor_tcp_check: process check based on TCP connection status.
- supervisor_xmlrpc_check: process check based on call to XML RPC server.
- supervisor_memory_check: process check based on amount of memory consumed by process.
- supervisor_cpu_check: process check based on CPU percent usage within time interval.
- supervisor_complex_check: complex check(run multiple checks at once).
For now, it is developed and supposed to work primarily with Python 3 and Supervisor 4 branch. There's nominal Python 2.x support but it's not tested.
Developing Custom Check Modules
While framework provides the good set of ready-for-use health check classes, it can be easily extended by adding application-specific custom health checks.
To implement custom check class, check_modules.base.BaseCheck class must be inherited:
class BaseCheck(object):
"""Base class for checks.
"""
NAME = None
def __call__(self, process_spec):
"""Run single check.
:param dict process_spec: process specification dictionary as returned
by SupervisorD API.
:return: True is check succeeded, otherwise False. If check failed -
monitored process will be automatically restarted.
:rtype: bool
"""
def _validate_config(self):
"""Method may be implemented in subclasses. Should return None or
raise InvalidCheckConfig in case if configuration is invalid.
Here's typical example of parameter check:
if 'url' not in self._config:
raise errors.InvalidCheckConfig(
'Required `url` parameter is missing in %s check config.' % (
self.NAME,))
"""
Here's the example of adding custom check:
from supervisor_checks.check_modules import base
from supervisor_checks import check_runner
class ExampleCheck(base.BaseCheck):
NAME = 'example'
def __call__(self, process_spec):
# Always return True
return True
if __name__ == '__main__':
check_runner.CheckRunner(
'example_check', 'some_process_group', [(ExampleCheck, {})]).run()
Out-of-box checks
HTTP Check
Process check based on HTTP query.
CLI
$ /usr/local/bin/supervisor_http_check -h
usage: supervisor_http_check [-h] -n CHECK_NAME -g PROCESS_GROUP -u URL -p
PORT [-t TIMEOUT] [-r NUM_RETRIES]
Run HTTP check program.
optional arguments:
-h, --help show this help message and exit
-n CHECK_NAME, --check-name CHECK_NAME
Health check name.
-g PROCESS_GROUP, --process-group PROCESS_GROUP
Supervisor process group name.
-u URL, --url URL HTTP check url
-p PORT, --port PORT HTTP port to query. Can be integer or regular
expression which will be used to extract port from a
process name.
-t TIMEOUT, --timeout TIMEOUT
Connection timeout. Default: 15
-r NUM_RETRIES, --num-retries NUM_RETRIES
Connection retries. Default: 2
Configuration Examples
Query process running on port 8080 using URL /ping:
[eventlistener:example_check]
command=/usr/local/bin/supervisor_http_check -g example_service -n example_check -u /ping -t 30 -r 3 -p 8080
events=TICK_60
Query process group using URL /ping. Each process is listening on it's own port. Each process name is formed as some-process-name_port so particular port number can be extracted using regular expression:
[eventlistener:example_check]
command=/usr/local/bin/supervisor_http_check -g example_service -n example_check -u /ping -t 30 -r 3 -p ".+_(\\d+)"
events=TICK_60
TCP Check
Process check based on TCP connection status.
CLI
$ /usr/local/bin/supervisor_tcp_check -h
usage: supervisor_tcp_check [-h] -n CHECK_NAME -g PROCESS_GROUP -p PORT
[-t TIMEOUT] [-r NUM_RETRIES]
Run TCP check program.
optional arguments:
-h, --help show this help message and exit
-n CHECK_NAME, --check-name CHECK_NAME
Check name.
-g PROCESS_GROUP, --process-group PROCESS_GROUP
Supervisor process group name.
-p PORT, --port PORT TCP port to query. Can be integer or regular
expression which will be used to extract port from a
process name.
-t TIMEOUT, --timeout TIMEOUT
Connection timeout. Default: 15
-r NUM_RETRIES, --num-retries NUM_RETRIES
Connection retries. Default: 2
Configuration Examples
Connect to process running on port 8080:
[eventlistener:example_check]
command=/usr/local/bin/supervisor_tcp_check -g example_service -n example_check -t 30 -r 3 -p 8080
events=TICK_60
Query process group when each process is listening on it's own port. Each process name is formed as some-process-name_port so particular port number can be extracted using regular expression:
[eventlistener:example_check]
command=/usr/local/bin/supervisor_tcp_check -g example_service -n example_check -t 30 -r 3 -p ".+_(\\d+)"
events=TICK_60
XMLRPC Check
Process check based on call to XML RPC server.
CLI
$ /usr/local/bin/supervisor_xmlrpc_check -h
usage: supervisor_xmlrpc_check [-h] -n CHECK_NAME -g PROCESS_GROUP [-u URL]
[-s SOCK_PATH] [-S SOCK_DIR] [-p PORT]
[-r NUM_RETRIES]
Run XML RPC check program.
optional arguments:
-h, --help show this help message and exit
-n CHECK_NAME, --check-name CHECK_NAME
Health check name.
-g PROCESS_GROUP, --process-group PROCESS_GROUP
Supervisor process group name.
-u URL, --url URL XML RPC check url
-s SOCK_PATH, --socket-path SOCK_PATH
Full path to XML RPC server local socket
-S SOCK_DIR, --socket-dir SOCK_DIR
Path to XML RPC server socket directory. Socket name
will be constructed using process name:
<process_name>.sock.
-m METHOD, --method METHOD
XML RPC method name. Default is status
-p PORT, --port PORT Port to query. Can be integer or regular
expression which will be used to extract port from a
process name.
-r NUM_RETRIES, --num-retries NUM_RETRIES
Connection retries. Default: 2
Configuration Examples
Call to process' XML RPC server listening on port 8080, URL /status, RPC method get_status:
[eventlistener:example_check]
command=/usr/local/bin/supervisor_xmlrpc_check -g example_service -n example_check -r 3 -p 8080 -u /status -m get_status
events=TICK_60
Call to process' XML RPC server listening on UNIX socket:
[eventlistener:example_check]
command=/usr/local/bin/supervisor_xmlrpc_check -g example_service -n example_check -r 3 -s /var/run/example.sock -m get_status
events=TICK_60
Call to process group XML RPC servers, listening on different UNIX socket. In such case socket directory must be specified, process socket name will be formed as <process_name>.sock:
[eventlistener:example_check]
command=/usr/local/bin/supervisor_xmlrpc_check -g example_service -n example_check -r 3 -S /var/run/ -m get_status
events=TICK_60
Memory Check
Process check based on amount of memory consumed by process.
CLI
$ /usr/local/bin/supervisor_memory_check -h
usage: supervisor_memory_check [-h] -n CHECK_NAME -g PROCESS_GROUP -m MAX_RSS
[-c CUMULATIVE]
Run memory check program.
optional arguments:
-h, --help show this help message and exit
-n CHECK_NAME, --check-name CHECK_NAME
Health check name.
-g PROCESS_GROUP, --process-group PROCESS_GROUP
Supervisor process group name.
-m MAX_RSS, --msx-rss MAX_RSS
Maximum memory allowed to use by process, KB.
-c CUMULATIVE, --cumulative CUMULATIVE
Recursively calculate memory used by all process
children.
Configuration Examples
Restart process if the total amount of memory consumed by process and all its children is greater than 100M:
[eventlistener:example_check]
command=/usr/local/bin/supervisor_memory_check -n example_check -m 102400 -c -g example_service
events=TICK_60
CPU Check
Process check based on CPU percent usage within specified time interval.
CLI
$ /usr/local/bin/supervisor_cpu_check -h
usage: supervisor_cpu_check [-h] -n CHECK_NAME -g PROCESS_GROUP -p MAX_CPU -i INTERVAL
Run memory check program.
optional arguments:
-h, --help show this help message and exit
-n CHECK_NAME, --check-name CHECK_NAME
Health check name.
-g PROCESS_GROUP, --process-group PROCESS_GROUP
Supervisor process group name.
-p MAX_CPU, --max-cpu-percent MAX_CPU
Maximum CPU percent usage allowed to use by process
within time interval.
-i INTERVAL, --interval INTERVAL
How long process is allowed to use CPU over threshold,
seconds.
Configuration Examples
Restart process when it consumes more than 100% CPU within 30 minutes:
[eventlistener:example_check]
command=/usr/local/bin/supervisor_cpu_check -n example_check -p 100 -i 1800 -g example_service
events=TICK_60
Complex Check
Complex check(run multiple checks at once).
CLI
$ /usr/local/bin/supervisor_complex_check -h
usage: supervisor_complex_check [-h] -n CHECK_NAME -g PROCESS_GROUP -c
CHECK_CONFIG
Run SupervisorD check program.
optional arguments:
-h, --help show this help message and exit
-n CHECK_NAME, --check-name CHECK_NAME
Health check name.
-g PROCESS_GROUP, --process-group PROCESS_GROUP
Supervisor process group name.
-c CHECK_CONFIG, --check-config CHECK_CONFIG
Check config in JSON format
Example configuration
Here's example configuration using memory and http checks:
[eventlistener:example_check]
command=/usr/local/bin/supervisor_complex_check -n example_check -g example_service -c '{"memory":{"cumulative":true,"max_rss":4194304},"http":{"timeout":15,"port":8090,"url":"\/ping","num_retries":3}}'
events=TICK_60
Acknowledgement
This is inspired by Superlance plugin package.
Though, while Superlance is basically the set
of feature-rich health check programs, supervisor_checks
package is mostly focused on providing
the framework to easily implement application-specific health checks of any complexity.
Bug reports
Please file here: https://github.com/vovanec/supervisor_checks/issues
Or contact me directly: vovanec@gmail.com