SaltStack: Automated cloudwatch alarm management for AWS resources

For the Salt 2014.7 released we (Lyft) upstreamed a number of Salt execution and state modules for AWS. These modules manage various AWS resources. For most of the resources you’ll want to create, you’ll probably want to add cloudwatch alarms to go along with them. It’s not really difficult to do:

Ensure myelb exists:
  boto_elb.present:
    - name: myelb
    - subnets:
        - subnet-12345
        - subnet-12346
        - subnet-12347
    - security_groups:
        - elb
    - listeners:
        - elb_port: 443
          instance_port: 80
          elb_protocol: HTTPS
          instance_protocol: HTTP
          certificate: 'arn:aws:iam::1234:server-certificate/mycert'
    - health_check:
        target: 'HTTP:80/'
    - attributes:
        cross_zone_load_balancing:
          enabled: true
        connection_draining:
          enabled: true
          timeout: 30
    - profile: myprofile

Ensure myelb ELB UnHealthyHostCount alarm exists:
  boto_cloudwatch_alarm.present:
    - name: 'myelb ELB UnHealthyHostCount **MANAGED BY SALT**'
    - attributes:
        metric: UnHealthyHostCount
        namespace: AWS/ELB
        statistic: Average
        comparison: '>='
        threshold: 1.0
        period: 600
        evaluation_periods: 6
        unit: null
        description: myelb ELB UnHealthyHostCount >= 1
        dimensions:
          LoadBalancerName: [myelb]
        alarm_actions: ['arn:aws:sns:us-east-1:12345:alarm-myservice']
        ok_actions: ['arn:aws:sns:us-east-1:12345:alarm-myservice']
        insufficient_data_actions: []
    - profile: myprofile

It’s not difficult, but there’s a few problems here:

  1. It’s really verbose.
  2. You have to remember to do it for every resource.
  3. For most (or all) of your ELBs, you’ll probably want the exact same alarm.

We believe that it should be easy to do the right thing and difficult to do the wrong thing. In SaltStack 2015.2 (in feature freeze at the time of this writing) we’ve made it possible to have some resources manage their own alarms. Even better, we’ve also made it possible to define these alarms through pillars, and for each resource type we use a default pillar key, so that you can define an alarm once and have it apply automatically for all resources of that type.

For example, here’s how to define an ELB resource to manage its alarm:

Ensure myelb exists:
  boto_elb.present:
    - name: myelb
    - subnets:
        - subnet-12345
        - subnet-12346
        - subnet-12347
    - security_groups:
        - elb
    - listeners:
        - elb_port: 443
          instance_port: 80
          elb_protocol: HTTPS
          instance_protocol: HTTP
          certificate: 'arn:aws:iam::1234:server-certificate/mycert'
    - health_check:
        target: 'HTTP:80/'
    - attributes:
        cross_zone_load_balancing:
          enabled: true
        connection_draining:
          enabled: true
          timeout: 30
    - alarms:
        UnHealthyHostCount:
          name: 'ELB UnHealthyHostCount **MANAGED BY SALT**'
          attributes:
            metric: UnHealthyHostCount
            namespace: AWS/ELB
            statistic: Average
            comparison: '>='
            threshold: 1.0
            period: 600
            evaluation_periods: 6
            unit: null
            description: ELB UnHealthyHostCount >= 1
            alarm_actions: ['arn:aws:sns:us-east-1:12345:alarm-myservice']
            ok_actions: ['arn:aws:sns:us-east-1:12345:alarm-myservice']
            insufficient_data_actions: []
    - profile: myprofile

The name defined for the alarm will have the ELB name prepended, to ensure uniqueness.

If we want the same alarm to apply to multiple ELBs, we can move it to a pillar:

my_elb_alarms:
  UnHealthyHostCount:
    name: 'ELB UnHealthyHostCount **MANAGED BY SALT**'
    attributes:
      metric: UnHealthyHostCount
      namespace: AWS/ELB
      statistic: Average
      comparison: '>='
      threshold: 1.0
      period: 600
      evaluation_periods: 6
      unit: null
      description: ELB UnHealthyHostCount >= 1
      alarm_actions: ['arn:aws:sns:us-east-1:12345:alarm-myservice']
      ok_actions: ['arn:aws:sns:us-east-1:12345:alarm-myservice']
      insufficient_data_actions: []

Then we can define the ELB resource like so:

Ensure myelb exists:
  boto_elb.present:
    - name: myelb
    - subnets:
        - subnet-12345
        - subnet-12346
        - subnet-12347
    - security_groups:
        - elb
    - listeners:
        - elb_port: 443
          instance_port: 80
          elb_protocol: HTTPS
          instance_protocol: HTTP
          certificate: 'arn:aws:iam::1234:server-certificate/mycert'
    - health_check:
        target: 'HTTP:80/'
    - attributes:
        cross_zone_load_balancing:
          enabled: true
        connection_draining:
          enabled: true
          timeout: 30
    - alarms_from_pillars: my_elb_alarms
    - profile: myprofile

boto_elb_alarms is a special pillar key. If it’s defined, all boto_elb resources will use it. So, if we change the pillar key to that, we can remove the ‘alarms_from_pillars’ line from the ELB resource and the alarm will still apply.

Notice in the above alarm that alarm_actions and ok_actions specify ARNs for a specific service? If you’re defining the alarms for multiple services, it’s necessary to make that specific to the service.

Our orchestration is a bit custom and we call a set of states for each service, passing in environment variables, which set custom grains. We use those grains in the pillar defined alarms, so that we can generically set the service name in the pillars. However, you don’t need to have our custom implementation to make this work for you. It’s possible to override values in the alarms:

Ensure myelb exists:
  boto_elb.present:
    - name: myelb
    - subnets:
        - subnet-12345
        - subnet-12346
        - subnet-12347
    - security_groups:
        - elb
    - listeners:
        - elb_port: 443
          instance_port: 80
          elb_protocol: HTTPS
          instance_protocol: HTTP
          certificate: 'arn:aws:iam::1234:server-certificate/mycert'
    - health_check:
        target: 'HTTP:80/'
    - attributes:
        cross_zone_load_balancing:
          enabled: true
        connection_draining:
          enabled: true
          timeout: 30
    - alarms:
        UnHealthyHostCount:
          attributes:
            alarm_actions: ['arn:aws:sns:us-east-1:12345:alarm-mycustomservice']
            ok_actions: ['arn:aws:sns:us-east-1:12345:alarm-mycustomservice']

In the above, only alarm_actions and ok_actions will be overridden, letting you set this on any ELB that needs to have a non-default value.

For the 2015.2 release we’ve only added support for boto_elb and boto_asg, but we plan on adding support for a number of the other boto_* modules in future releases.

Truly ordered execution using SaltStack (Part 2)

A while back I wrote a post about sequentially ordered SaltStack execution. 2014.7 (Helium) has been released and the listen/listen_in feature I described is now generally available. It’s been about 6 months since I’ve been using Salt in a sequentially ordered manner and there’s some other patterns I’ve picked up here. Particularly, there’s a couple gotchas to watch out for: includes and Jinja.

Includes imply a requirement between modules. Requirements can modify ordering, so it’s important to be strict about how you handle them. For example, when reading the following, remember that include implies require:

top.sls:

base:
  '*':
    - base
    - service

service.sls:

include:
  - apache
  - redis

apache.sls:

include:
  - vhost

So, the order you’re getting here is: base, vhost, apache, redis, service. If you remember this is the behavior, then it isn’t actually difficult when you’re reading the file. Because of this, I recommend always placing the include statement at the top of a state, if you’re going to use it mixed in with other states. This is important, because this won’t do what you assume:

Ensure apache is installed:
  pkg.installed:
    - name: apache2

include:
  - vhost

Ensure apache is started:
  service.running:
    - name: apache2

In the above, if the vhost module required the apache2 package to be installed before it can run, it’ll fail. This is because even if the include is placed after the package state, it’s still being included before it, because the vhost module is a requirement for the current module.

What can we do if we need this behavior, then? Jinja to the rescue:

Ensure apache is installed:
  pkg.installed:
    - name: apache2

{% include 'vhost.sls' %}

Ensure apache is started:
  service.running:
    - name: apache2

When doing includes via Jinja, the file is simply being included from this module’s context. The contents of the vhost.sls file will be placed into that location. This occurs before further Jinja evaluation as well, so this is a completely safe way to handle this situation. Of course, it would be nice for salt to have a native way of handling that, so I have an open issue for this.

Note that the syntax for the Jinja include is different. Rather than using . for separation, Jinja includes use / and the .sls extension is required. Like state includes, the Jinja includes start from the file root, so apache.vhost would be “apache/vhost.sls”.

Jinja itself is something to pay attention to as well. Jinja is always evaluated before the state execution. This is useful in a number of ways (for instance, conditionally including or excluding large portions of code, or looping a bunch of states together), but it’s also a bit confusing when you are considering ordering. For instance, this won’t work:

Ensure myelb exists:
  boto_elb.present:
    - name: myelb
    - availability_zones:
        - us-east-1a
    - listeners:
        - elb_port: 80
          instance_port: 80
          elb_protocol: HTTP
        - elb_port: 443
          instance_port: 80
          elb_protocol: HTTPS
          instance_protocol: HTTP
          certificate: 'arn:aws:iam::123456:server-certificate/mycert'
    - health_check:
        target: 'TCP:80'
    - profile: myprofile

{% set elb = salt['boto_elb.get_elb_config']('myelb', profile='myprofile') %}

Ensure myrecord.example.com cname points at ELB:
  boto_route53.present:
    - name: myrecord.example.com.
    - zone: example.com.
    - type: CNAME
    - value: {{ elb.dns_name }}
    - profile: myprofile

When you read this in order, it looks completely logical. However, the Jinja is always going to be executed before the states, so the elb variable is going to be set to None, then the ELB will be created, then the route53 record will fail to be created.

This is a contrived example, since the boto_elb.present function will handle route53 on your behalf, but it illustrates an issue you’ll need to watch out for. Always remember Jinja will execute first and protect against it.

Using Lua in Nginx for unique request IDs and millisecond times in logs

Nginx is awesome, but it’s missing some common features. For instance, a common thing to add to access logs is a unique ID per request, so that you can track the flow of a single request through multiple services. Another thing it’s missing is the ability to log request_time in milliseconds, rather than seconds with a millisecond granularity. Using Lua, we can add these features ourselves.

I’ll show the whole solution, then I’ll break it down into parts:

http {
...

        map $host $request_time_ms {
            default '';
        }
        map $host $uuid {
            default '';
        }

        lua_package_path '/etc/nginx/uuid4.lua';
        init_by_lua '
            uuid4 = require "uuid4"
            math = require "math"
        ';

        log_by_lua '
          ngx.var.request_time_ms = math.floor(tonumber(ngx.var.request_time) * 1000)
        ';

        log_format mycustomformat '[$time_local] "$request" $status $request_length $bytes_sent $request_time_ms $uuid';
        access_log /var/log/nginx/access.log mycustomformat;

...
}

server {
...

  set_by_lua $uuid '
    if ngx.var.http_x_request_id == nil then
        return uuid4.getUUID()
    else
        return ngx.var.http_x_request_id
    end
  ';

...
}

It’s necessary to set variables before we use them in Lua. Using map is a trick to set variables in the http context (you can’t use set $variable ” in http). For the case of uuid, we are going to set it in the server section (during the rewrite context), but in case it’s not set, we want to avoid throwing errors. Here’s how we set these variables:

        map $host $request_time_ms {
            default '';
        }
        map $host $uuid {
            default '';
        }

Next we add a uuid4 library to our path, and include the libraries into our context:

        lua_package_path '/etc/nginx/uuid4.lua';
        init_by_lua '
            uuid4 = require "uuid4"
            math = require "math"
        ';

Using the log_by_lua function, we’ll set the request_time_ms variable we’ll use in the log_format config. This Lua function is called in the log context, before logs are written, allowing us to make the variables available to it:

        log_by_lua '
            ngx.var.request_time_ms = math.floor(tonumber(ngx.var.request_time) * 1000)
        ';

Next we set the log format, and use it for the access log:

        log_format mycustomformat '[$time_local] "$request" $status $request_length $bytes_sent $request_time_ms $uuid';
        access_log /var/log/nginx/access.log mycustomformat;

Lastly, we set the uuid during the rewrite context in the server section, using set_by_lua. To facilitate following a request across services, we’ll reuse the header if it’s already set. If the header isn’t set, then this request didn’t come from another service, so we’ll generate a UUID:

server {
...

  set_by_lua $uuid '
    if ngx.var.http_x_request_id == nil then
        return uuid4.getUUID()
    else
        return ngx.var.http_x_request_id
    end
  '

...
}

If you’re trusting this header data in any way, you should be sure to filter/restrict that header appropriately so that the client can’t change it.

Update (Thursday December 11 2014): Edited the post to move the uuid generation into the server section and using set_by_lua, so that the uuid can be set to/from the header to flow through the stacks properly. Shout out to Asher Feldman for working out a better solution with me.

Reloading grains and pillars during a SaltStack run

If you use the grain/state pattern a lot, or if you use external pillars you’ve probably stumbled upon a limitation with grains and pillars.

During a Salt run, if you set a grain, or update an external pillar, it won’t be reflected in the grains and pillars found in the grains and pillar dictionaries. This is because you’ve updated it, but it hasn’t been reloaded into the in-memory data structures that salt creates at the beginning of the run. From a performance point of view this is good, since reloading grains and especially loading external pillars is quite slow.

To fix this limitation, I added a PR to add the global state arguments reload_grains and reload_pillar. These work similar to reload_modules (and in fact, they imply reload_modules).

For example, if you’re using the etcd external pillar, the following will now work:

Ensure etcd key exist for host:
  module.run:
    - name: etcd.set
    - key: /myorg/servers/myhost
    - value: {{ grains['ip_interfaces']['eth0'][0] }}
    - profile: etcd_profile
    - reload_pillar: True

Ensure example file has pillar contents:
  file.managed:
    - name: /tmp/test.txt
    - contents_pillar: servers:myhost

Note, though, that jinja in sls files is executed before the states are executed, so this will still not work:

Ensure etcd key exist for host:
  module.run:
    - name: etcd.set
    - key: /myorg/servers/myhost
    - value: {{ grains['ip_interfaces']['eth0'][0] }}
    - profile: etcd_profile
    - reload_pillar: True

Ensure example file has pillar contents:
  file.managed:
    - name: /tmp/test.txt
    - contents: {{ pillar.servers.myhost }}

Jinja used in template files is executed along with the state in order, so that’ll work without issue.

This change is merged into the develop branch of Salt, and will be included in the first stable release of 2015 (Lithium).

Config Management Code Reuse isn’t Always the Right Approach

Have you ever looked at an upstream DSL module and thought: “this is exactly what I need!”? Maybe if you’re using multiple distros, multiple releases of those distros and/or multiple operating systems you may say this occasionally. Maybe you also say this if you have a single ops group that handles all of your infrastructure.

I’ve rarely been happy with upstream modules,¬†even in shops with a single ops group. They are more complex than I want and are always missing something I need. Why is this?

  1. They need to be abstract enough to support multiple distros, multiple releases of those distros and often multiple operating systems.
  2. Due to the above support they usually need to support multiple versions of the application being wrapped.
  3. They need to support every configuration option available and since they support multiple versions of the application they also need to support options that may or may not be valid.
  4. Thanks to all of the above, they need to be configured by variable data (hiera/pillars/variables). Turning the variable data into another mini DSL. The mini-DSL is usually poorly constructed and completely undocumented.
  5. Their code is riddled with conditionals and numerous ugly templates.

Let’s consider Apache, the eternal poster child for over complication and underwhelming feature support in DSL modules. I’ve never seen a usable upstream Apache DSL module. Here’s a common set of tasks that are needed to configure Apache:

  1. Install Apache from a package.
  2. Enable some modules.
  3. Add some configuration to a vhost.
  4. Enable the vhost.
  5. Disable the default vhost.
  6. Restart (and manage) the Apache service.

It’s simple and doing just this with an abstract module is pretty easy, assuming it just takes a template for the vhost. The abstract module starts getting difficult when you start considering how configurable Apache is and how different Apache configs look between applications. The module is going to get complex really quickly.

Now consider you have 10 teams working on 10 different services (which is a low estimate, assuming SOA or microservices), all of which use this module. If one of those teams needs to modify the module they are also affecting every other team. Changes will need to be coordinated across teams on every change. If the module is upstream you also need to consider every other user of the module or may need to fork it. You may even need to fork it internally.

Why bother with the abstraction? It’ll be simpler, quicker, and will require less coordination between teams to let each team manage their own basic Apache config on their own. When the config needs to get more complex for one team it doesn’t introduce complexity for other teams. It also means that it’s possible to go with the simplest possible implementation. You may have ten copies of roughly the same code, but it’ll be ten simple and maintainable copies versus one complex, difficult to change copy.

The heart of this is that many DSL implementations aren’t well equipped for reuse. If you really want to write something reusable, find the complexities of your DSL implementation, and implement them as features of your configuration management system. It’s a lot easier to write reusable abstract code in a real language.

The title of this post says that reuse isn’t always the right approach, but I don’t want to imply it’s never the right approach. It’s surely possible to write clean, abstract, reusable DSL code. At some point when there’s a bunch of copies of the same code you’ll notice patterns and that’s a good time to start looking at abstraction. For simple modules that don’t have a lot of configuration it’s likely that it’ll even be really useful upstream. Salt Formulas, Puppet Forge, and Ansible Galaxy all have good examples of reusable modules and you should obviously consider them, but you should also consider if you can maintain your own much simpler version instead.

SaltStack Development: Behavior of Exceptions in Modules

The SaltStack developer docs are missing information about exceptions that can be thrown and how the state system and the CLI behaves when they are thrown.

Thankfully this is easy to test and is actually a pretty good development exercise. So, let’s write an execution module, a state module, and an sls file, then run them to determine the behavior.

A simple example execution module

modules/modules/example.py:

from salt.exceptions import CommandExecutionError

def example(name):
    if name == 'succeed':
        return True
    elif name == 'fail':
        return False
    else:
        raise CommandExecutionError('Example function failed due to unexpected input.')

A simple example state module

modules/states/example.py:

def present(name):
    ret = {'name': name, 'result': True, 'comment': '', 'changes': {}}
    if not __salt__['example.example'](name):
        ret['result'] = False
    return ret

In the above we’re calling the execution module via the salt dunder dictionary, which is a special convenience method for calling execution modules without needing to know their inclusion path.

Also in the above we’re using a special return format that Salt expects to receive when calling state modules.

A simple example sls file

example.sls:

Test fail behavior:
  example.present:
    - name: fail

Test error behavior:
  example.present:
    - name: error

Test succeed behavior:
  example.present:
    - name: succeed

Testing the behavior

State run behavior

# salt-call --retcode-passthrough --file-root . -m modules state.sls example
local:
----------
          ID: Test fail behavior
    Function: example.present
        Name: fail
      Result: False
     Comment:
     Started: 05:22:16.220802
     Duration: 0 ms
     Changes:
----------
          ID: Test error behavior
    Function: example.present
        Name: error
      Result: False
     Comment: An exception occurred in this state: Traceback (most recent call last):
                File "/srv/salt/venv/src/salt/salt/state.py", line 1518, in call
                  **cdata['kwargs'])
                File "/root/modules/states/example.py", line 3, in present
                  if not __salt__['example.example'](name):
                File "/root/modules/modules/example.py", line 9, in example
                  raise CommandExecutionError('Example function failed due to unexpected input.')
              CommandExecutionError: Example function failed due to unexpected input.
     Started: 05:22:16.221730
     Duration: 1 ms
     Changes:
----------
          ID: Test succeed behavior
    Function: example.present
        Name: succeed
      Result: True
     Comment:
     Started: 05:22:16.223336
     Duration: 0 ms
     Changes:

Summary
------------
Succeeded: 1
Failed:    2
------------
Total states run:     3

# echo $?
2

In the above I’m executing salt-call with a couple options. I’m including the modules and sls files explicitly from my relative path (‘-m modules’ and ‘–file-root .’). I do this for convenience and to be completely positive that my code is being loaded from exactly where I expect.

The state run behavior isn’t surprising. The exception is passed through to the output when there’s a legitimate error, otherwise the False value indicates a normal state failure. The return code is non-zero when using –retcode-passthrough as well.

CLI behavior

# salt-call -m modules example.example 'succeed'
local:
    True

# echo $?
0

# salt-call -m modules testme.testme 'fail'
local:
    False

# echo $?
0

# salt-call -m modules example.example 'error'
Error running 'example.example': Example function failed due to unexpected input.

# echo $?
1

When the execution module’s function successfully returns (with either True or False), Salt prints the result through stdout and returns a zero return code. When the function throws an exception, Salt prints an error to stderr and returns a non-zero return code.

Dealing with splunkforwarder via Config Management

The splunkforwarder package is very poorly written, at least for Debian/Ubuntu. There’s a number of things it does that make it difficult to use:

  1. It installs a splunk user and group, but doesn’t install them as system users/groups, so they’ll conflict with your uids/gids.
  2. It requires manual interaction the first time you start the daemon, on every single system it’s installed on.
  3. It modifies its configuration files when the daemon restarts.

The first is an honest mistake, but the last two put me into a blind rage. There’s not great documentation about how to workaround this, so to avoid other opsen going into rages here’s how you can handle this shitty package:

  1. Add a splunk user/group as a system user before the package is installed.
  2. Install the package.
  3. Replace the init script (I have one you can use below).
  4. Use a consistent hash for the SSL password. This won’t completely avoid the rewriting of the config files, but it’ll avoid the more common case where a rewrite will occur.
  5. Start the daemon.

Replacing the init script

Any single Splunk command you initially run will require that you accept a license. If you’re using config management, it’s likely that it’s going to use ‘service status’ before it does ‘service start’ or ‘service restart’. This means it’s necessary to add ‘–no-prompt –answer-yes –accept-license’ to each relevant command. Just to be safe lets just always call it.

#!/bin/sh
#
# /etc/init.d/splunk
# init script for Splunk.
# generated by 'splunk enable boot-start'.
#
### BEGIN INIT INFO
# Provides: splunkd
# Required-Start: $remote_fs
# Required-Stop: $remote_fs
# Default-Start: 2 3 4 5
# Default-Stop: 0 1 6
# Short-Description: Start splunk
# Description: Splunk indexer service
### END INIT INFO
#
RETVAL=0

splunk_start() {
  echo Starting Splunk...
  "/opt/splunkforwarder/bin/splunk" start --no-prompt --answer-yes --accept-license
  RETVAL=$?
}
splunk_stop() {
  echo Stopping Splunk...
  "/opt/splunkforwarder/bin/splunk" stop --no-prompt --answer-yes --accept-license
  RETVAL=$?
}
splunk_restart() {
  echo Restarting Splunk...
  "/opt/splunkforwarder/bin/splunk" restart --no-prompt --answer-yes --accept-license
  RETVAL=$?
}
splunk_status() {
  echo Splunk status:
  "/opt/splunkforwarder/bin/splunk" status --no-prompt --answer-yes --accept-license
  RETVAL=$?
}
case "$1" in
  start)
    splunk_start
    ;;
  stop)
    splunk_stop
    ;;
  restart)
    splunk_restart
    ;;
  status)
    splunk_status
    ;;
  *)
    echo "Usage: $0 {start|stop|restart|status}"
    exit 1
    ;;
esac

exit $RETVAL

Use a consistent hash for the SSL password

This is insanely annoying. The SSL password is just used for Splunk to access its SSL key. The splunk daemon will convert any clear-text password in its configuration files to a hashed password, based off a secret that is generated per-host. Obviously Splunk is ticking off a Government compliance checklist with this stupidity. There is absolutely no security benefit to what Splunk is doing here. Not only does Splunk change this line, but it also reorders the file, assuring there’s no sane way to handle this via config management.

Thankfully, it’s possible to pre-set the secret that’s being used in /opt/splunkforwarder/etc/auth/splunk.secret. On a test node, generate some long secret, place it into that file, then set your cleartext ssl password (which is password, by default) and restart Splunk. It’ll generate a hash and rewrite your configuration file. Now, take that hash and your generated secret and use them in your configuration management. Since you have a stable secret across your nodes and a hash that was generated from it, Splunk won’t change it.

Concurrent and Queued SaltStack State Runs

The function "state.highstate" is running as PID 17587 and was started at 2014, Aug 29 23:21:46.540749 with jid 20140829232146540749

Ever get an error like that? Salt doesn’t allow more than a single state run to occur at a time, to ensure that multiple state runs can’t interfere with each other. This is really important, for instance, if you run highstate on a schedule, since a second highstate might be called before the previous one had finished.

What if you use states as simple function calls, for frequent actions, though? Or what if you’re using Salt for orchestration and want to run multiple salt-calls for a number of different salt state files concurrently?

In the Salt 2014.7 (Helium) release it’s possible to specify that you want a state run to occur concurrently:

# salt-call state.sls mystate concurrent=True

Let’s assume you do need to ensure two Salt runs don’t occur concurrently, but rather than having the command fail you want to wait until the first run finishes. As of the 2014.7 release this is also possible via the queue argument:

# salt-call state.sls mystate queue=True

As I was writing this post a thread was started on the salt users group where this subject was brought up, and it seems at the time of this writing that concurrent isn’t working properly for Windows minions. Hopefully that will get fixed before the 2014.7 release, but if not, just keep an eye out for Windows support.


Update (September 2, 2014):

As Colton mentions in the comments, using concurrent is dangerous, since Salt’s state system isn’t designed to be run concurrently. This doesn’t mean that you can’t use concurrent, if you use it in a way you know is safe.

I use concurrent for things I know won’t conflict, such as a state file that calls a set of execution modules to make a simple access point to a complex set of standard actions, like registering or deregistering an ELB. Concurrent can be really important for this, if a Salt run hangs on a node, and you need to deregister the node.

Another example of where I use concurrent is for the boto_* state modules, for orchestration. I know two services won’t have state definitions that conflict, since they reference completely different infrastructure.

So, that said, the code is marked with a large warning statement so you may be told to not use concurrent if you’re using it in a way that’s dangerous, when you open issues in github.

SaltStack Patterns: Grain/State

It’s occasionally necessary to do actions in configuration management that aren’t easy to define in an idempotent way. For instance, sometimes you need to do an action only the first time your configuration management runs, or you need to fetch some static information from an external source, or you want to put instances in a specific state for a temporary period of time.

In SaltStack (Salt) a common pattern for handling this is what I call the Grain/State pattern. Salt’s grains are relatively static, but it’s possible to add, update, or delete custom grains during a state run, or outside of a state run either by salt-call locally or through remote execution. Grains can be used for conditionals inside of state runs to control the state of the system dynamically.

For instance, it’s possible to have a command run only on initial run:

{% if salt['grains.get']('initial_run', True) %}
Ensure instance has been registered with the ELB:
  module.run:
    - name: boto_elb.register_instances
    - m_name: {{ grains['cluster_name'] }}
    - instances:
      - {{ grains['ec2_instance-id'] }}

Ensure initial_run grain is set:
  grains.present:
    - name: initial_run
    - value: False
{% endif %}

If you wanted to cache an external call that always returns static data, it’s possible to use Jinja and the grains state module:

{% if salt['grains.get']('static_auth_endpoint', False) %}
{% set static_auth_endpoint = salt['my_fake_module.get_auth_endpoint'](grains['cluster_name']) %}
Ensure initial_run grain is set:
  grains.present:
    - name: static_auth_endpoint
    - value: {{ static_auth_endpoint }}
{% endif %}

This is obviously a contrived example and it’s likely better to store this information in a pillar or an external pillar, but it shows how it’s possible to use this pattern when necessary.

If you wanted to set grains using salt-call or remote execution to modify your state run behavior, see the highstate killswitch example I recently posted.

This pattern offers a lot of flexibility, especially for caching expensive calls that would normally occur every run when using Python based grains, or the analogous equivalent in other systems.

This pattern can also be combined with another Salt pattern I call the Returner/External Pillar model. I’ll cover that in a followup post in the future.

A SaltStack Highstate Killswitch

On rare occasion it’s necessary to debug a condition on a system by making temporary changes to the running system. If you’re using config management, especially as part of your deployment process, it’s necessary to disable it so that your temporary changes won’t be reset. salt-call doesn’t natively have a mechanism for this like Puppet does (puppet agent –disable; puppet agent –enable). It’s possible to do this yourself, though.

This requires that you’re using the failhard option in your configuration, that you’re using the 2014.7 (Helium) or above release, and also assumes you have some base state that is always included and is always included first.

First add two states:

highstate/enable.sls:

Ensure highstate is enabled:
  grains.present:
    - name: highstate_disabled
    - value: False

highstate/disable.sls:

Ensure highstate is disabled:
  grains.present:
    - name: highstate_disabled
    - value: True

In your base state:

{% if salt['grains.get']('highstate_disabled', False) %}
Exit if highstate is disabled:
  test.fail_without_changes:
    - name: "Salt highstate is disabled. To re-enable, call 'salt-call state.sls highstate.enable'."
{% endif %}

Now, let’s test it!

# salt-call state.sls highstate.disable
local:
  Name: highstate_disabled - Function: grains.present - Result: Changed

Summary
------------
Succeeded: 1 (changed=1)
Failed:    0
------------
Total states run:     1

# salt-call state.highstate
local:
----------
          ID: Exit if highstate is disabled
    Function: test.fail_without_changes
        Name: Salt highstate is disabled. To re-enable, call 'salt-call state.sls highstate.enable'.
      Result: False
     Comment: Failure!
     Started: 00:42:20.384504
     Duration: 1 ms
     Changes:

Summary
------------
Succeeded: 0
Failed:    1
------------
Total states run:     1

# salt-call state.sls highstate.enable
local:
  Name: highstate_disabled - Function: grains.present - Result: Changed

Summary
------------
Succeeded: 1 (changed=1)
Failed:    0
------------
Total states run:     1

# salt-call state.highstate
local:

Summary
------------
Succeeded: 0
Failed:    0
------------
Total states run:     0

Note in the above that I’m using a couple settings that make my output strange:

# Show terse output for successful states and full output for failures.
state_output: mixed

# Only show changes
state_verbose: False

So, a run that doesn’t change anything will show that it didn’t run anything. Total states run: 0 is a successful highstate run in that situation.

Something to note about this solution is that it will only disable highstate and may not disable state.sls, state.template, or other state calls, if the initial state isn’t included. This obviously won’t disable any other module call either.