Shoehorn helps you handle OTP application failures


Keywords
nerves
License
Apache-2.0

Documentation

Shoehorn

CircleCI Hex version

Shoehorn helps you handle OTP application failures

Motivation

By default, the Erlang VM exits when OTP applications unexpectedly stop. This can happen if an application's Application.start/2 callback crashes or if a GenServer crashes repeatedly and takes down the application's supervision tree. Either way, recovery needs to happen outside of the Erlang VM.

Shoehorn provides a way of handling this inside the Erlang VM to allow you to debug, restart an application, switch to a recovery mode, or something else of your choosing. It does this by creating a custom release start script (shoehorn.boot) and exposing the Shoehorn.Handler behaviour for your code to decide what to do. The custom release start script turns off the default OTP application mode that exits the VM on unexpected errors and orders application starts to make sure that the handler is available.

Shoehorn has another benefit of letting you influence the OTP application start order. Dependencies still determine the overall ordering, but it's possible to sort applications earlier via Shoehorn's :init option. This can let you improve the apparent release startup time on slow platforms.

Usage

Run mix release.init on your project and then add shoehorn to your mix releases configuration in the mix.exs (replace :simple_app):

  def project do
    [
      ...
      releases: releases()
    ]
  end

  def releases do
    [
      simple_app: [
        steps: [&Shoehorn.Release.init/1, :assemble]
      ]
    ]
  end

  defp deps do
    [
      {:shoehorn, "~> 0.9.2"}
    ]
  end
end

Create a release:

mix release

Next, take a look at the start script so that you can see how your application will now be started and how it compares to the default startup.script. Open _build/dev/rel/simple_app/releases/0.1.0/shoehorn.script and go to the end. You should see something like the following:

     {progress,applications_loaded},
     {apply,{application,start_boot,[kernel,permanent]}},
     {apply,{application,start_boot,[stdlib,permanent]}},
     {apply,{application,start_boot,[compiler,permanent]}},
     {apply,{application,start_boot,[elixir,permanent]}},
     {apply,{application,start_boot,[logger,permanent]}},
     {apply,{application,start_boot,[crypto,permanent]}},
     {apply,{application,start_boot,[shoehorn,permanent]}},
     {apply,{application,start_boot,[sasl,permanent]}},
     {apply,{application,start_boot,[simple_app,temporary]}},
     {progress,started}

This shows the order that applications will be started and their mode. Applications marked permanent will exit the VM if they stop expectantly. Shoehorn will change as much as it can to temporary so that it (and by extension, you) can control what happens.

To start your release using the shoehorn boot script, run:

RELEASE_BOOT_SCRIPT=shoehorn _build/dev/rel/simple_app/bin/simple_app start_iex

It should work as expected with the possible exception that the Erlang VM won't exit for any of the OTP applications marked temporary.

Now let's configure shoehorn to do something more interesting by adding some minimal configuration. This is hypothetical unless you're using Nerves:

# config/config.exs

config :shoehorn,
  init: [:nerves_runtime, :nerves_pack]

Shoehorn will generate a release script that starts :nerves_runtime and its dependencies as soon as it can. Then it will start :nerves_pack and its dependencies. Then it will start the remainder of the applications in the project. Inspect the shoehorn.script file in the release directory to verify this.

Use the init application list to prioritize OTP applications that are needed for early on or for error recovery. In the example above, we initialize the runtime, bring up the network (in :nerves_pack), and ensure that we can receive new firmware updates. Now, if simple_app fails to start, the device would still be in a state where it can receive new firmware over the network.

Handling application failures

The Erlang VM will respond to application failures differently, depending on the mode specified when the application started. The modes are:

  • :permanent - if the application terminates, all other applications and the entire node are also terminated.
  • :transient - if the application terminates with :normal reason, it is reported but no other applications are terminated. However, if the application terminates abnormally, all other applications and the entire node are also terminated.
  • :temporary - if the application terminates, it is reported but no other applications are terminated (the default behaviour).

Unless overridden in the Mix release using the :applications option, Shoehorn most applications as :temporary and monitors application events by registering with the Erlang error_logger.

Application start and exit events will attempt to execute a callback to the configured Shoehorn.Handler module. By default, the module Shoehorn.DefaultHandler will be called. This module is configured to continue the Erlang VM if any OTP application were to exit, for any reason. In production, you may want to customize the action on failure so you can gather forensics or perform updates to the node. You can do this by overriding the handler in the prod env of your application config.

# config/prod.exs

config :shoehorn,
  handler: SimpleApp.ShoehornHandler

More advanced failure cases can be handled by providing your own module that implements the Shoehorn.Handler behaviour. For example, the Erlang :ssh application used to exit when subjected to a brute force attack (this seems like it has been fixed). Instead of the default production behaviour of forcing the node to restart, we can restart the application.

defmodule Example.RestartHandler do
  @behavior Shoehorn.Handler

  def init(_opts) do
    {:ok, :no_state}
  end

  def application_started(_app, state) do
    {:continue, state}
  end

  def application_exited(:ssh, _reason, state) do
    Logger.error("Stop bothering ssh!")
    Process.sleep(1000)
    Application.ensure_all_started(:ssh)
    {:continue, state}
  end

  def application_exited(app, _reason, state) do
    Logger.error("Application stopped! #{inspect(app)} #{inspect(state)}")
    {:halt, state}
  end
end

The application_exited/3 callback is limited in the amount of time is has to execute by setting a shutdown timer. If the callback does not return within the defined shutdown time, the node is instructed to halt. The default shutdown time is 30 seconds but this value can be changed in the application config:

# config/config.exs

config :shoehorn,
  shutdown_timer: 50_000 # 50 Seconds

Have a look at the example application for more info on implementing custom strategies.