github.com/jamiealquiza/polymur/output

A fast carbon-relay with live routing controls + https Graphite forwarder


Keywords
carbon-relay, graphite, metrics, monitoring
License
Other
Install
go get github.com/jamiealquiza/polymur/output

Documentation

polymur

Undergoing active testing / dev. May have undiscovered bugs. Definitely build this with Go 1.5.x.

Overview

Polymur is a service that accepts Graphite plaintext protocol metrics (LF delimited messages) from tools like Collectd or Statsd, and either replicates, round-robins or hash routes the output to one or more destinations. Polymur is efficient at terminating many thousands of connections and provides in-line buffering (should a destination become temporarily unavailable), runtime destination manipulation ("Let's mirror all our production metrics to x.x.x.x"), and failover redistribution (in round-robin mode: if node C fails, redistribute in-flight metrics for this destination to nodes A and B).

Polymur was created to introduce more flexibility into the way metrics streams are managed and to reduce the total number of components needed to operate Graphite deployments. It's built in a highly concurrent fashion and doesn't need multiple instances per-node with a local load-balancer if it's being used as a Carbon relay upstream from your Graphite servers. If it's being used as a Carbon relay on your Graphite server to distribute metrics to Carbon-cache daemons, daemons can self register themselves on start using Polymur's simple API.

Polymur replacing upstream relays

Terminating connections from all sending hosts and distributing to downstream Graphite servers:

ScreenShot

Polymur replacing local relays

Polymur running on a Graphite server in hash-routing mode, distributing metrics from upstream relays to local carbon-cache daemons:

ScreenShot

Installation

  • go get github.com/jamiealquiza/polymur
  • go install github.com/jamiealquiza/polymur
  • Binary will be found at $GOPATH/bin/polymur

Usage

Usage of ./polymur:
  -api-addr string
        API listen address (default "localhost")
  -api-port string
        API listen port (default "2030")
  -console-out
        Dump output to console
  -destinations string
        Comma-delimited list of ip:port destinations
  -distribution string
        Destination distribution methods: broadcast, balance-rr, balance-hr (default "broadcast")
  -listen-addr string
        Polymur listen address (default "0.0.0.0")
  -listen-port string
        Polymur listen port (default "2003")
  -metrics-flush int
        Graphite flush interval for runtime metrics (0 is disabled)
  -queue-cap int
        In-flight message queue capacity to any single destination (default 4096)
  -stat-addr string
        runstats listen address (default "localhost")
  -stat-port string
        runstats listen port (default "2020")

Examples

Running

Listening for incoming metrics on 0.0.0.0:2003 and mirroring to 10.0.5.20:2003 and 10.0.5.30:2003:

./polymur -destinations=10.0.5.20:2003,10.0.5.30:2003 -metrics-flush=30 -listen-port=2003 -listen-addr=0.0.0.0 -distribution="broadcast"
2015/05/14 15:19:08 Registered destination 10.0.5.20:2003
2015/05/14 15:19:08 Registered destination 10.0.5.30:2003
2015/05/14 15:19:08 Adding destination to connection pool: 10.0.5.30:2003
2015/05/14 15:19:08 Adding destination to connection pool: 10.0.5.20:2003
2015/05/14 15:19:09 API started: localhost:2030
2015/05/14 15:19:09 Runstats started: localhost:2020
2015/05/14 15:19:09 Metrics listener started: 0.0.0.0:2003
2015/05/14 15:19:14 Last 5s: Received 7276 data points | Avg: 1455.20/sec. | Inbound queue length: 0
2015/05/14 15:19:19 Last 5s: Received 6471 data points | Avg: 1294.20/sec. | Inbound queue length: 0
2015/05/14 15:19:24 Last 5s: Received 6744 data points | Avg: 1348.80/sec. | Inbound queue length: 0
2015/05/14 15:19:29 Last 5s: Received 5806 data points | Avg: 1161.20/sec. | Inbound queue length: 0

Changing destinations

Polymur started with no initial destinations:

% echo getdest | nc localhost 2030
{
 "active": [],
 "registered": {}
}

Add destination at runtime:

% echo putdest localhost:6020 | nc localhost 2030          
Registered destination: localhost:6020

% echo getdest | nc localhost 2030
{
 "active": [
  "localhost:6020"
 ],
 "registered": {
  "localhost:6020": "2015-05-14T09:09:23.620410265-06:00"
 }
}

Polymur output:

./polymur -distribution="balance-rr" -listen-port="2003"
2015/05/14 09:09:15 Metrics listener started: 0.0.0.0:2003
2015/05/14 09:09:15 API started: localhost:2030
2015/05/14 09:09:15 Runstats started: localhost:2020
2015/05/14 09:09:23 Registered destination localhost:6020
2015/05/14 09:09:23 Adding destination to connection pool: localhost:6020

Internals

Terminology:

  • Destination: where metrics will be forwarded using the Graphite plaintext protocol
  • Destination queue: metrics in-flight for a given destination
  • Registered: a candidate destination loaded into Polymur, but not necessarily active
  • Connection: a registered destination with an active connection
  • Connection pool: global list of all active connections and their respective destination queue
  • Distribution mode: how metrics are distributed to destinations (round-robin, broadcast)
  • Retry queue: messages that couldn't be sent to their destination are loaded into the retry queue (e.g. a round-robin node is removed from the connection pool)

Polymur listens on the configured addr:port for incoming connections, each connection handled in a dedicated Goroutine. A connection Goroutine reads the inbound stream and allocates a message string at LF boundaries. Messages are accumulated as slices of string pointers and flushed to a shared inbound queue, either after 5 seconds or when the slice hits 30 elements (to reduce channel operations).

Message batches from the inbound queue are distributed (broadcast, round-robin or hash-routed) to a dedicated queue for each output destination. Destination output is also handled using dedicated Goroutines, where temporary latency or full disconnects to one destination will not impact write performance to another destination. If a destination becomes unreachable, the endpoint will be retried at 15 second intervals while the respective destination queue buffers incoming messages. Per destination queue capacity is determined by the -queue-cap directive. Any destination queue with an outstanding length greater than 0 will be logged to stdout. Any destination queue that exceeds the configured -queue-cap will not receive any new messages until the queue is cleared. If the distribution mode is round-robin, three consecutive reconnect attempt failures will result in removing the connection from the connection pool and redistributing any in-flight messages to the retry queue.

FAQ

Why no pickle protocol?

Pickling is a native Python construct; Polymur is written in Go (acknowledging 3rd party libraries exist). More importantly, I have yet to encounter a situation where network was starved before any Carbon daemon became CPU bound, and data serialization certainly doesn't improve that situation. That said, I will be adding protobuf for Polymur to Polymur communication.