SaltStack: Automated cloudwatch alarm management for AWS resources

For the Salt 2014.7 released we (Lyft) upstreamed a number of Salt execution and state modules for AWS. These modules manage various AWS resources. For most of the resources you’ll want to create, you’ll probably want to add cloudwatch alarms to go along with them. It’s not really difficult to do:

Ensure myelb exists:
  boto_elb.present:
    - name: myelb
    - subnets:
        - subnet-12345
        - subnet-12346
        - subnet-12347
    - security_groups:
        - elb
    - listeners:
        - elb_port: 443
          instance_port: 80
          elb_protocol: HTTPS
          instance_protocol: HTTP
          certificate: 'arn:aws:iam::1234:server-certificate/mycert'
    - health_check:
        target: 'HTTP:80/'
    - attributes:
        cross_zone_load_balancing:
          enabled: true
        connection_draining:
          enabled: true
          timeout: 30
    - profile: myprofile

Ensure myelb ELB UnHealthyHostCount alarm exists:
  boto_cloudwatch_alarm.present:
    - name: 'myelb ELB UnHealthyHostCount **MANAGED BY SALT**'
    - attributes:
        metric: UnHealthyHostCount
        namespace: AWS/ELB
        statistic: Average
        comparison: '>='
        threshold: 1.0
        period: 600
        evaluation_periods: 6
        unit: null
        description: myelb ELB UnHealthyHostCount >= 1
        dimensions:
          LoadBalancerName: [myelb]
        alarm_actions: ['arn:aws:sns:us-east-1:12345:alarm-myservice']
        ok_actions: ['arn:aws:sns:us-east-1:12345:alarm-myservice']
        insufficient_data_actions: []
    - profile: myprofile

It’s not difficult, but there’s a few problems here:

  1. It’s really verbose.
  2. You have to remember to do it for every resource.
  3. For most (or all) of your ELBs, you’ll probably want the exact same alarm.

We believe that it should be easy to do the right thing and difficult to do the wrong thing. In SaltStack 2015.2 (in feature freeze at the time of this writing) we’ve made it possible to have some resources manage their own alarms. Even better, we’ve also made it possible to define these alarms through pillars, and for each resource type we use a default pillar key, so that you can define an alarm once and have it apply automatically for all resources of that type.

For example, here’s how to define an ELB resource to manage its alarm:

Ensure myelb exists:
  boto_elb.present:
    - name: myelb
    - subnets:
        - subnet-12345
        - subnet-12346
        - subnet-12347
    - security_groups:
        - elb
    - listeners:
        - elb_port: 443
          instance_port: 80
          elb_protocol: HTTPS
          instance_protocol: HTTP
          certificate: 'arn:aws:iam::1234:server-certificate/mycert'
    - health_check:
        target: 'HTTP:80/'
    - attributes:
        cross_zone_load_balancing:
          enabled: true
        connection_draining:
          enabled: true
          timeout: 30
    - alarms:
        UnHealthyHostCount:
          name: 'ELB UnHealthyHostCount **MANAGED BY SALT**'
          attributes:
            metric: UnHealthyHostCount
            namespace: AWS/ELB
            statistic: Average
            comparison: '>='
            threshold: 1.0
            period: 600
            evaluation_periods: 6
            unit: null
            description: ELB UnHealthyHostCount >= 1
            alarm_actions: ['arn:aws:sns:us-east-1:12345:alarm-myservice']
            ok_actions: ['arn:aws:sns:us-east-1:12345:alarm-myservice']
            insufficient_data_actions: []
    - profile: myprofile

The name defined for the alarm will have the ELB name prepended, to ensure uniqueness.

If we want the same alarm to apply to multiple ELBs, we can move it to a pillar:

my_elb_alarms:
  UnHealthyHostCount:
    name: 'ELB UnHealthyHostCount **MANAGED BY SALT**'
    attributes:
      metric: UnHealthyHostCount
      namespace: AWS/ELB
      statistic: Average
      comparison: '>='
      threshold: 1.0
      period: 600
      evaluation_periods: 6
      unit: null
      description: ELB UnHealthyHostCount >= 1
      alarm_actions: ['arn:aws:sns:us-east-1:12345:alarm-myservice']
      ok_actions: ['arn:aws:sns:us-east-1:12345:alarm-myservice']
      insufficient_data_actions: []

Then we can define the ELB resource like so:

Ensure myelb exists:
  boto_elb.present:
    - name: myelb
    - subnets:
        - subnet-12345
        - subnet-12346
        - subnet-12347
    - security_groups:
        - elb
    - listeners:
        - elb_port: 443
          instance_port: 80
          elb_protocol: HTTPS
          instance_protocol: HTTP
          certificate: 'arn:aws:iam::1234:server-certificate/mycert'
    - health_check:
        target: 'HTTP:80/'
    - attributes:
        cross_zone_load_balancing:
          enabled: true
        connection_draining:
          enabled: true
          timeout: 30
    - alarms_from_pillars: my_elb_alarms
    - profile: myprofile

boto_elb_alarms is a special pillar key. If it’s defined, all boto_elb resources will use it. So, if we change the pillar key to that, we can remove the ‘alarms_from_pillars’ line from the ELB resource and the alarm will still apply.

Notice in the above alarm that alarm_actions and ok_actions specify ARNs for a specific service? If you’re defining the alarms for multiple services, it’s necessary to make that specific to the service.

Our orchestration is a bit custom and we call a set of states for each service, passing in environment variables, which set custom grains. We use those grains in the pillar defined alarms, so that we can generically set the service name in the pillars. However, you don’t need to have our custom implementation to make this work for you. It’s possible to override values in the alarms:

Ensure myelb exists:
  boto_elb.present:
    - name: myelb
    - subnets:
        - subnet-12345
        - subnet-12346
        - subnet-12347
    - security_groups:
        - elb
    - listeners:
        - elb_port: 443
          instance_port: 80
          elb_protocol: HTTPS
          instance_protocol: HTTP
          certificate: 'arn:aws:iam::1234:server-certificate/mycert'
    - health_check:
        target: 'HTTP:80/'
    - attributes:
        cross_zone_load_balancing:
          enabled: true
        connection_draining:
          enabled: true
          timeout: 30
    - alarms:
        UnHealthyHostCount:
          attributes:
            alarm_actions: ['arn:aws:sns:us-east-1:12345:alarm-mycustomservice']
            ok_actions: ['arn:aws:sns:us-east-1:12345:alarm-mycustomservice']

In the above, only alarm_actions and ok_actions will be overridden, letting you set this on any ELB that needs to have a non-default value.

For the 2015.2 release we’ve only added support for boto_elb and boto_asg, but we plan on adding support for a number of the other boto_* modules in future releases.