SaltStack: Automated cloudwatch alarm management for AWS resources

For the Salt 2014.7 released we (Lyft) upstreamed a number of Salt execution and state modules for AWS. These modules manage various AWS resources. For most of the resources you’ll want to create, you’ll probably want to add cloudwatch alarms to go along with them. It’s not really difficult to do:

Truly ordered execution using SaltStack (Part 2)

A while back I wrote a post about sequentially ordered SaltStack execution. 2014.7 (Helium) has been released and the listen/listen_in feature I described is now generally available. It’s been about 6 months since I’ve been using Salt in a sequentially ordered manner and there’s some other patterns I’ve picked up here. Particularly, there’s a couple gotchas to watch out for: includes and Jinja.

Includes imply a requirement between modules. Requirements can modify ordering, so it’s important to be strict about how you handle them. For example, when reading the following, remember that include implies require:


Using Lua in Nginx for unique request IDs and millisecond times in logs

Nginx is awesome, but it’s missing some common features. For instance, a common thing to add to access logs is a unique ID per request, so that you can track the flow of a single request through multiple services. Another thing it’s missing is the ability to log request_time in milliseconds, rather than seconds with a millisecond granularity. Using Lua, we can add these features ourselves.

I’ll show the whole solution, then I’ll break it down into parts:

Reloading grains and pillars during a SaltStack run

If you use the grain/state pattern a lot, or if you use external pillars you’ve probably stumbled upon a limitation with grains and pillars.

During a Salt run, if you set a grain, or update an external pillar, it won’t be reflected in the grains and pillars found in the grains and pillar dictionaries. This is because you’ve updated it, but it hasn’t been reloaded into the in-memory data structures that salt creates at the beginning of the run. From a performance point of view this is good, since reloading grains and especially loading external pillars is quite slow.

Config Management Code Reuse isn’t Always the Right Approach

Have you ever looked at an upstream DSL module and thought: “this is exactly what I need!”? Maybe if you’re using multiple distros, multiple releases of those distros and/or multiple operating systems you may say this occasionally. Maybe you also say this if you have a single ops group that handles all of your infrastructure.

I’ve rarely been happy with upstream modules,¬†even in shops with a single ops group. They are more complex than I want and are always missing something I need. Why is this?

  1. They need to be abstract enough to support multiple distros, multiple releases of those distros and often multiple operating systems.

SaltStack Development: Behavior of Exceptions in Modules

The SaltStack developer docs are missing information about exceptions that can be thrown and how the state system and the CLI behaves when they are thrown.

Thankfully this is easy to test and is actually a pretty good development exercise. So, let’s write an execution module, a state module, and an sls file, then run them to determine the behavior.

A simple example execution module


from salt.exceptions import CommandExecutionError

def example(name):
    if name == 'succeed':
        return True
    elif name == 'fail':
        return False
        raise CommandExecutionError('Example function failed due to unexpected input.')

A simple example state module


Dealing with splunkforwarder via Config Management

The splunkforwarder package is very poorly written, at least for Debian/Ubuntu. There’s a number of things it does that make it difficult to use:

  1. It installs a splunk user and group, but doesn’t install them as system users/groups, so they’ll conflict with your uids/gids.
  2. It requires manual interaction the first time you start the daemon, on every single system it’s installed on.
  3. It modifies its configuration files when the daemon restarts.

The first is an honest mistake, but the last two put me into a blind rage. There’s not great documentation about how to workaround this, so to avoid other opsen going into rages here’s how you can handle this shitty package:

Concurrent and Queued SaltStack State Runs

The function "state.highstate" is running as PID 17587 and was started at 2014, Aug 29 23:21:46.540749 with jid 20140829232146540749

Ever get an error like that? Salt doesn’t allow more than a single state run to occur at a time, to ensure that multiple state runs can’t interfere with each other. This is really important, for instance, if you run highstate on a schedule, since a second highstate might be called before the previous one had finished.

What if you use states as simple function calls, for frequent actions, though? Or what if you’re using Salt for orchestration and want to run multiple salt-calls for a number of different salt state files concurrently?

SaltStack Patterns: Grain/State

It’s occasionally necessary to do actions in configuration management that aren’t easy to define in an idempotent way. For instance, sometimes you need to do an action only the first time your configuration management runs, or you need to fetch some static information from an external source, or you want to put instances in a specific state for a temporary period of time.

In SaltStack (Salt) a common pattern for handling this is what I call the Grain/State pattern. Salt’s grains are relatively static, but it’s possible to add, update, or delete custom grains during a state run, or outside of a state run either by salt-call locally or through remote execution. Grains can be used for conditionals inside of state runs to control the state of the system dynamically.

A SaltStack Highstate Killswitch

On rare occasion it’s necessary to debug a condition on a system by making temporary changes to the running system. If you’re using config management, especially as part of your deployment process, it’s necessary to disable it so that your temporary changes won’t be reset. salt-call doesn’t natively have a mechanism for this like Puppet does (puppet agent –disable; puppet agent –enable). It’s possible to do this yourself, though.

This requires that you’re using the failhard option in your configuration, that you’re using the 2014.7 (Helium) or above release, and also assumes you have some base state that is always included and is always included first.