SaltStack AWS Orchestration and Masterless Bootstrapping

In my last post, I mentioned that we’re using SaltStack (Salt) without a master. Without a master, how are we bootstrapping our instances? How are we updating the code that’s managing the instances? For this, we’re using python virtualenvs, S3, autoscaling groups with IAM roles, cloud-init and an artifact-based deployer that stores artifacts in S3 and pulls them onto the instances. Let’s start with how we’re creating the AWS resources.

Orchestration

We’re using Salt for orchestration. A while ago I wrote some custom code for environment provisioning that started with creating MongoDB databases and Heroku applications and later added management of AWS resources. I spent a few weeks turning our custom code into state and execution modules for Salt. We’re now using the following Salt states for orchestration of AWS resources:

Through these states we create all of the resources for a service and environment. Here’s an example of a simple web application:

Ensure myapp security group exists:
  boto_secgroup.present:
    - name: myapp
    - description: myapp security group
    - rules:
    - ip_protocol: tcp
      from_port: 80
      to_port: 80
      source_group_name: amazon-elb-sg
      source_group_owner_id: amazon-elb
    - profile: aws_profile

{% set service_instance = 'testing' %}

Ensure myapp-{{ service_instance }}-useast1 iam role exists:
  boto_iam_role.present:
    - name: myapp-{{ service_instance }}-useast1
    - policies:
        'bootstrap':
          Version: '2012-10-17'
          Statement:
            - Action:
                - 'elasticloadbalancing:DeregisterInstancesFromLoadBalancer'
                - 'elasticloadbalancing:RegisterInstancesWithLoadBalancer'
              Effect: 'Allow'
              Resource: 'arn:aws:elasticloadbalancing:*:*:loadbalancer/myapp-{{ service_instance }}-useast1'
            - Action:
                - 's3:Head*'
                - 's3:Get*'
              Effect: 'Allow'
              Resource:
                - 'arn:aws:s3:::bootstrap/deploy/myapp/*'
            - Action:
                - 's3:List*'
                - 's3:Get*'
              Effect: 'Allow'
              Resource:
                - 'arn:aws:s3:::bootstrap'
              Condition:
                StringLike:
                  's3:prefix':
                    - 'deploy/myapp/*'
            - Action:
                - 'ec2:DescribeTags'
              Effect: 'Allow'
              Resource:
                - '*'
        'myapp-{{ service_instance }}-sqs':
          Version: '2012-10-17'
          Statement:
            - Action:
                - 'sqs:ChangeMessageVisibility'
                - 'sqs:DeleteMessage'
                - 'sqs:GetQueueAttributes'
                - 'sqs:GetQueueUrl'
                - 'sqs:ListQueues'
                - 'sqs:ReceiveMessage'
                - 'sqs:SendMessage'
              Effect: 'Allow'
              Resource:
                - 'arn:aws:sqs:*:*:myapp-{{ service_instance }}-*'
              Sid: 'myapp{{ service_instance }}sqs1'
    - profile: aws_profile

Ensure myapp-{{ service_instance }} security group exists:
  boto_secgroup.present:
    - name: myapp-{{ service_instance }}
    - description: myapp-{{ service_instance }} security group
    - profile: aws_profile

Ensure myapp-{{ service_instance }}-useast1 elb exists:
  boto_elb.present:
    - name: myapp-{{ service_instance }}-useast1
    - availability_zones:
      - us-east-1a
      - us-east-1d
      - us-east-1e
    - listeners:
        - elb_port: 80
          instance_port: 80
          elb_protocol: HTTP
        - elb_port: 443
          instance_port: 80
          elb_protocol: HTTPS
          instance_protocol: HTTP
          certificate: 'arn:aws:iam::12snip34:server-certificate/a-certificate'
    - health_check:
        target: 'HTTP:80/'
    - attributes:
        access_log:
        enabled: true
        s3_bucket_name: 'logs'
        s3_bucket_prefix: 'myapp-{{ service_instance }}-useast1'
        emit_interval: '5'
    - cnames:
      - name: myapp-{{ service_instance }}.example.com.
        zone: example.com.
    - profile: aws_profile

{% for queue in ['example-queue-1', 'example-queue-2'] %}
Ensure myapp-{{ service_instance }}-{{ queue }} sqs queue is present:
  boto_sqs.present:
    - name: myapp-{{ service_instance }}-{{ queue }}
    - profile: aws_profile
{% endfor %}

Ensure myapp-{{ service_instance }}-useast1 asg exists:
  boto_asg.present:
    - name: myapp-{{ service_instance }}-useast1
    - launch_config_name: myapp-{{ service_instance }}-useast1
    - launch_config:
      - image_id: ami-fakeami
      - key_name: example-key
      - security_groups:
        - base
        - myapp
        - myapp-{{ service_instance }}
      - instance_type: c3.large
      - instance_monitoring: true
      - cloud_init:
          scripts:
            salt: |
              #!/bin/bash
              apt-get -y update
              apt-get install -y python-m2crypto python-crypto python-zmq python-pip python-virtualenv python-apt git-core

              wget https://s3.amazonaws.com/bootstrap/salt/bootstrap.tar.gz
              tar -xzvPf bootstrap.tar.gz

              time /srv/pulldeploy/venv/bin/python /srv/pulldeploy/pulldeploy.py myapp {{ service_instance }} -v && salt-call state.sls elb.register
    - availability_zones:
      - us-east-1a
      - us-east-1d
      - us-east-1e
    - suspended_processes:
      - AddToLoadBalancer
    - min_size: 30
    - max_size: 30
    - load_balancers:
      - myapp-{{ service_instance }}-useast1
    - instance_profile_name: myapp-{{ service_instance }}-useast1
    - scaling_policies:
      - name: ScaleDown
        adjustment_type: ChangeInCapacity
        scaling_adjustment: -1
        cooldown: 1800
      - name: ScaleUp
        adjustment_type: ChangeInCapacity
        scaling_adjustment: 5
        cooldown: 1800
    - tags:
      - key: 'Name'
        value: 'myapp-{{ service_instance }}-useast1'
        propagate_at_launch: true
    - profile: aws_profile

autoscale up alarm:
  boto_cloudwatch_alarm.present:
    - name: 'myapp-{{ service_instance }}-useast1-asg-up-CPU-Utilization'
    - attributes:
        metric: CPUUtilization
        namespace: AWS/EC2
        statistic: Average
        comparison: '>='
        threshold: 50.0
        period: 300
        evaluation_periods: 1
        unit: null
        description: ''
        dimensions:
          AutoScalingGroupName:
            - myapp-{{ service_instance }}-useast1
        alarm_actions:
          - 'scaling_policy:myapp-{{ service_instance }}-useast1:ScaleUp'
          - 'arn:aws:sns:us-east-1:12snip34:hipchat-notify'
        insufficient_data_actions: []
        ok_actions: []
    - profile: aws_profile

autoscale down alarm:
  boto_cloudwatch_alarm.present:
    - name: 'myapp-{{ service_instance }}-useast1-asg-down-CPU-Utilization'
    - attributes:
        metric: CPUUtilization
        namespace: AWS/EC2
        statistic: Average
        comparison: <=
        threshold: 10.0
        period: 300
        evaluation_periods: 1
        unit: null
        description: ''
        dimensions:
          AutoScalingGroupName:
            - myapp-{{ service_instance }}-useast1
        alarm_actions:
          - 'scaling_policy:myapp-{{ service_instance }}-useast1:ScaleDown'
          - 'arn:aws:sns:us-east-1:12snip34:hipchat-notify'
        insufficient_data_actions: []
        ok_actions: []
    - profile: aws_profile

I know this doesn’t look very simple at first, but this configuration is meant for scale. The numbers and instance sizes here are fake and don’t reflect any of our production services; they’re meant as an example, so adjust your configuration to meet your needs. This configuration carries out all of the following actions, in order:

  1. Manages a myapp security group with two rules, meant for blanket rules for this service.
  2. Manages an IAM role, with a number of policies.
  3. Manages a myapp-{{ service_instance }} security group, meant for testing security group rules or per-service_instance rules.
  4. Manages an ELB and the Route53 DNS entries that point at it.
  5. Manages two SQS queues.
  6. Manages an autoscaling group, its associated launch configuration, and its scaling policies.
  7. Manages two cloudwatch alarms that are used for the autoscaling group’s scaling policies.

I say manages for all of those resources because making a change to them is simply a matter of modifying the state then re-running Salt.

From the Salt bootstrapping perspective, #2 and #6 are the key things we’ll be looking at. The IAM role allows the instance to access other AWS resources — in this case, the deploy directory of the bootstrap bucket. The launch configuration portion of the autoscaling group adds a Salt cloud-init script that installs Salt’s dependencies, wgets a tarred relocatable virtualenv for Salt and our deployer, untars it, then runs the deployer.

In the IAM role, autoscaling configuration, and cloud-init we have a special process for managing our ELBs. Our autoscaling groups disable the AddToLoadBalancer process, so new autoscaled instances won’t immediately be added to the ELB. Instead, in our launch configuration, after a successful initial Salt run the instance registers itself with its own ELB. Using the IAM policy we limit access to only allow instances that are associated with an ELB to register or deregister themselves.

Also in the IAM role we grant access only to a service’s particular deployment resources, to limit access across services. We similarly restrict access by the service_instance, where necessary, to restrict access across environments of a service.

Unfortunately AWS doesn’t provide the ability to limit access to describe tags on resources. We use autoscaling group tags in the bootstrapping process, which we’ll get to later, when discussing naming conventions.

Instance configuration

When the orchestration is run the resources are created and the bootstrapping process for the instances starts. This process starts from the launch configuration as described above, which in short is:

  1. Salt’s dependencies are installed.
  2. Salt and the deployer are fetched from S3 via wget. This artifact is public since it’s just Salt and deployer code, neither of which are sensitive. We’ve munged the link to avoid third parties using a Salt version they don’t control.
  3. The deployer is run, and if successful the instance is registered with its ELB.

To properly bootstrap the system it’s necessary for the system to pull down its required artifacts and to build itself based on its service and environment. The deployer starts this process. Its logic works as follows:

  1. Fetch the base and service artifacts.
  2. Create a /srv/base/current link that points at base’s current deployment directory.
  3. Create a /srv/service link that points at the service’s deployment directory.
  4. Create a /srv/service/next link to point at the artifact about to be deployed.
  5. Run pre-release hooks from the service repo.
  6. Run ‘salt-call state.highstate’.
  7. Create a /srv/service/current link to point at the artifact currently deployed.
  8. Run post-release hooks from the service repo.

We have a standard Salt configuration for all services, which is why we create a /srv/service link. Salt can always point to that location. Specifically, we point at /srv/service/next. In the above logic we run highstate between the creation of the next and current links. By doing so we can deploy a change that relies on a system dependency. Here’s our Salt minion configuration:

# For development purposes, always fail if any state fails. This makes it much
# easier to ensure first-runs will succeed.
failhard: True

# Show terse output for successful states and full output for failures.
state_output: mixed

# Only show changes
state_verbose: False

# Show basic information about what Salt is doing during its highstate. Set
# this to critical to disable logging output.
log_level: info

# Never try to connect to a master.
file_client: local
local: True

# Path to the states, files and templates.
file_roots:
  base:
    - /srv/service/next/salt/config/states
    - /srv/base/current/states

# Path to pillar variables.
pillar_roots:
  base:
    - /srv/service/next/salt/config/pillar
    - /srv/base/current/pillar

# Path to custom grain modules
grains_dirs:
  - /srv/base/current/grains

The deployer only handles getting the code artifacts onto the system and running hooks and Salt. Salt itself determines how the system will be configured based on the service and its environment.

We use Salt’s state and pillar top systems with grains to determine how a system will configure itself. Before we go into the top files, though, let’s explain the grains being used.

Standardized resource naming and grains

We name our resources by convention. This allows us to greatly simplify our code, since we can use this convention to refer to resources in orchestration, bootstrapping, IAM policy, and configuration management. The naming convention for our instances is:

service-service_instance-region-service_node.example.com

An example would be:

myapp-testing-useast1-898900.example.com

This hostname is based off of the autoscale group name, which would be:

service-service_instance-region

Or, in this example:

myapp-testing-useast1

During the bootstrapping process, when Salt is run a custom grain is called that fetches its autoscaling group name and the instance-id, then parses them and returns a number of grains:

  • service_name (myapp)
  • service_instance (testing)
  • service_node (898900)
  • region (useast1)
  • cluster_name (myapp-testing-useast1)
  • service_group (myapp-testing)

At the beginning of the Salt run, Salt ensures a hostname is set, based on the grains. The custom grain always checks to see if there’s a hostname set based on our naming convention. If so, it always parses the hostname and returns grains based on that. We do this to avoid unnecessary boto calls for future Salt runs. Another reason we set a hostname is so that we can use it in monitoring and reporting, to get a human-friendly name for instances.

Now, let’s go back into the top files, based on this info.

Top files using grain matching

Here’s an example pillar top file:

base:
  '*':
    - base
    - myapp
    - order: 1
  {% for root in opts['pillar_roots']['base'] -%}
  {% set service_group_sls = '{0}/{1}.sls'.format(root, grains['service_group']) -%}
  {% if salt['file.file_exists'](service_group_sls) %}
  'service_group:{{ grains["service_group"] }}':
    - match: grain
    - {{ grains['service_group'] }}
    - order: 10
  {% endif %}
  {% endfor -%}

The Jinja used here is to include a file if it exists and to ignore it otherwise. By doing this we can avoid editing the top file for most common cases. If a new environment is added, then a developer only needs to add the myapp-new_environment.sls file, and it’ll be automatically included.

We have these included in a specific order, since conflicting keys are overridden in inclusion order. We include them in order of most generic to most specific. In this case, for an instance with myapp-testing as its service group, it’ll include base, then myapp, then myapp-testing pillar files.

For instance, if we were only enabling monitoring in the testing environment, we could set a boolean pillar like so:

myapp.sls:

enable_monitoring: False

myapp-testing.sls:

enable_monitoring: True

This allows us to set generic defaults and override them where needed, so that we can use the least amount of pillars possible.

Here’s an example of our states top file:

base:
  '*':
    - base
    - order: 1
  'service_name:myapp':
    - match: grain
    - order: 10
    - myapp
  'service_name:myappbatch':
    - match: grain
    - order: 10
    - myapp

In this file we’re including base, then we’re including a service-specific state file. In this specific case both myapp and myappbatch are so similar that we’re including the same state file. For this case our differences are handled by pillars, rather than splitting the code apart.

Deployment

Notice that the bootstrapping is written in such a way that it’s simply doing an initial deployment. All further deployments use the same pattern. Salt is an essential part of our deployment process. It’s run on every single deployment. If a deployment is simply a code change with no Salt changes, the run is incredibly fast, since salt-call returns no-change runs in around 12 seconds. Since we’re always deploying base changes with any deploy, we also have a mechanism to update the base repository and make Salt changes on every system immediately.

13 Comments

  1. Hi Ryan,

    Thanks for sharing that! I wonder how do you distribute sensitive configuration changes (credentials, certificates) and application/services updates?

    I guess you upload (from CI) updated pillar data tars and the same goes for application/services. And the rest is handled by AWS ELB and Salt magic. Am I right?

    Also, I’m keen to read more on that topic – configuration over time management using Salt in AWS. I see a great deal of articles describing how to roll out stuff in the Cloud, however there is not so much information in the wild how to control and sustain it over time. My take on it – most of the tooling (e.g. Salt, Ansible, ..) are mostly initial roll out kind of beasts, not so much operation wise.

    If by any chance you have some experience to share regarding this and extra time to spare to write a post on it – I encourage you to do so ;) It’s something I and, I believe most of Salt community, is struggling with to some degree or another.

    Kudos!

    Reply

    1. The pillars can be put into S3 subdirectories, which are restricted by the IAM policy to resources in that service_instance (by relying on naming conventions).

      Reply

  2. Great post. I’m trying to clarify a couple of things though.
    You say that you’re using Salt for orchestration. Do you mean to say that you are using Salt’s remote execution tools/cli (e.g. salt ‘*’ test.ping) for node orchestration? If so, doesn’t this imply that you do have a master with which to control your minions? But you’re just not using it to deliver your Salt config management code (i.e. your states) hence describing your setup as masterless?
    Or, when you say orchestration, are you just referring to what you describe above, wherein you just run Salt states locally to orchestrate AWS? In that case, how are you orchestrating the nodes? Just via this (I’m assuming custom) deployer tool you’ve mentioned?
    In any case, good stuff.

    Reply

    1. We’re doing orchestration via states run from an orchestration node. We’re not using a master. Config management on the nodes is done via the (custom) deployer. It’s generally a very simple deployer. It generates artifacts via jenkins, stores them in S3 and the nodes themselves pull the artifacts from S3 and deploy whichever is marked as current.

      Reply

  3. Interesting article. I’m newish to Salt and Salt-cloud, but I’m wondering what salt command you run that large state to setup everything ie. all the boto stuff for IAM, ELB, LC, ASG setup etc..?

    I mean you’re not globbing it against instances or anything like ‘salt ‘*’ state.sls mystate’. Is it run with salt-cloud? like a profile? How does one run a state like that?

    Reply

    1. I use masterless, so I just call: ‘salt-call state.sls mystate’

      All the boto stuff I have in this article is state modules. It isn’t salt-cloud.

      Reply

  4. Ah okay, I was heading down a rabbit hole with reactor based stuff trying to automate my autoscale deploys/terminations.

    Is the boto stuff called locally as well? So you just have to ensure the instance running it locally has sufficient IAM permissions to make the calls? or specify IAM details within the module commands?

    I’ve got some tests up and running with the instance states based on this kind of flow, I before attempt to fully salt all the aws modules –but a few gaps with human readable conventions, is making it tricky.

    What approach do you take for setting the grain details per autoscale group? Would you mind sharing your grains state for obtaining aws details and setting the hostname etc?

    Reply

    1. The boto stuff is just salt states, so it can be called however you’d normally call your states. The instance running it either needs to have the right permissions via iam roles, or you need to ensure boto is configured with the right permissions, or you need to configure salt to pass the credentials in.

      We name our AWS resources with naming conventions as mentioned in that section of this post. We propagate the asg name tag to the instances, and we parse the name tag into grains. That’s also how we define the hostnames.

      We have pretty extensive AWS support in the boto_* modules and have more coming. If you get everything going and want to improve the boto modules, please do! :)

      Reply

    1. You can, but alas, for now you need to use the subnet ids. We have a proposal in place to reference subnets by vpc_name:subnet_name, but work still needs to be done there.

      Reply

  5. Ryan,

    Thank you for the post.
    When I follow your example and try to set only a Security group I get the following error message:
    [ERROR ] State ‘boto_secgroup.present’ was not found in SLS ‘myapp’
    Reason: ‘boto_secgroup’ virtual returned False

    local:

    ID: Ensure mysecgroup exists
    Function: boto_secgroup.present
    Name: mysecgroup
    Result: False
    Comment: State 'boto_secgroup.present' was not found in SLS 'myapp'
    Reason: 'boto_secgroup' __virtual__ returned False
    Started:
    Duration:
    Changes:

    Any idea?

    Thanks

    Summary for local

    Succeeded: 0

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *