Moving away from Puppet: SaltStack or Ansible?

Over the past month at Lyft we’ve been working on porting our infrastructure code away from Puppet. We had some difficulty coming to agreement on whether we wanted to use SaltStack (Salt) or Ansible. We were already using Salt for AWS orchestration, but we were divided on whether Salt or Ansible would be better for configuration management. We decided to settle it the thorough way by implementing the port in both Salt and Ansible, comparing them over multiple criteria.

First, let me start by explaining why we decided to port away from Puppet: We had a complex puppet code base that has around 10,000 lines of actual Puppet code. This code was originally spaghetti-code oriented and in the past year or so was being converted to a new pattern that used Hiera and Puppet modules split up into services and components. It’s roughly the role pattern, for those familiar with Puppet. The code base was a mixture of these two patterns and our DevOps team was comprised of almost all recently hired members who were not very familiar with Puppet and were unfamiliar with the code base. It was large, unwieldy and complex, especially for our core application. Our DevOps team was getting accustom to the Puppet infrastructure; however, Lyft is strongly rooted in the concept of ‘If you build it you run it’. The DevOps team felt that the Puppet infrastructure was too difficult to pick up quickly and would be impossible to introduce to our developers as the tool they’d use to manage their own services.

Before I delve into the comparison, we had some requirements of the new infrastructure:

  1. No masters. For Ansible this meant using ansible-playbook locally, and for Salt this meant using salt-call locally. Using a master for configuration management adds an unnecessary point of failure and sacrifices performance.
  2. Code should be as simple as possible. Configuration management abstractions generally lead to complicated, convoluted and difficult to understand code.
  3. No optimizations that would make the code read in an illogical order.
  4. Code must be split into two parts: base and service-specific, where each would reside in separate repositories. We want the base section of the code to cover configuration and services that would be deployed for every service (monitoring, alerting, logging, users, etc.) and we want the service-specific code to reside in the application repositories.
  5. The code must work for multiple environments (development, staging, production).
  6. The code should read and run in sequential order.

Here’s how we compared:

  1. Simplicity/Ease of Use
  2. Maturity
  3. Performance
  4. Community

Simplicity/Ease of Use

Ansible:

A couple team members had a strong preference to using Ansible as they felt it was easier to use than Salt, so I started by implementing the port in Ansible, then implementing it again in Salt.

As I started Ansible was indeed simple. The documentation was clearly structured which made learning the syntax and general workflow relatively simple. The documentation is oriented to running Ansible from a controller and not locally, which made the initial work slightly more difficult to pick up, but it wasn’t a major stumbling block. The biggest issue was needing to have an inventory file with ‘localhost’ defined and needing to use -c local on the command line. Additionally, Ansible’s playbook’s structure is very simple. There’s tasks, handlers, variables and facts. Tasks do the work in order and can notify handlers to do actions at the end of the run. The variables can be used via Jinja in the playbooks or in templates. Facts are gathered from the system and can be used like variables.

Developing the playbook was straightforward. Ansible always runs in order and exits immediately when an error occurs. This made development relatively easy and consistent. For the most part this also meant that when I destroyed my vagrant instance and recreated it that my playbook was consistently run.

That said, as I was developing I noticed that my ordering was occasionally problematic and needed to move things around. As I finished porting sections of the code I’d occasionally destroy and up my vagrant instance and re-run the playbook, then noticed errors in my execution. Overall using ordered execution was far more reliable than Puppet’s unordered execution, though.

My initial playbook was a single file. As I went to split base and service apart I noticed some complexity creeping in. Ansible includes tasks and handlers separately and when included the format changes, which was confusing at first. My playbook was now: playbook.yml, base.yml, base-handlers.yml, service.yml, and service-handlers.yml. For variables I had: user.yml and common.yml. As I was developing I generally needed to keep the handlers open so that I could easily reference them for the tasks.

The use of Jinja in Ansible is well executed. Here’s an example of adding users from a dictionary of users:

- name: Ensure groups exist
  group: name={{ item.key }} gid={{ item.value.id }}
  with_dict: users

- name: Ensure users exist
  user: name={{ item.key }} uid={{ item.value.id }} group={{ item.key }} groups=vboxsf,syslog comment="{{ item.value.full_name }}" shell=/bin/bash
  with_dict: users

For playbooks Ansible uses Jinja for variables, but not for logic. Looping and conditionals are built into the DSL. with/when/etc. control how individual tasks are handled. This is important to note because that means you can only loop over individual tasks. A downside of Ansible doing logic via the DSL is that I found myself constantly needing to look at the documentation for looping and conditionals. Ansible has a pretty powerful feature since it controls its logic itself, though: variable registration. Tasks can register data into variables for use in later tasks. Here’s an example:

- name: Check test pecl module
  shell: "pecl list | grep test | awk '{ print $2 }'"
  register: pecl_test_result
  ignore_errors: True
  changed_when: False

- name: Ensure test pecl module is installed
  command: pecl install -f test-1.1.1
  when: pecl_test_result.stdout != ‘1.1.1’

This is one of Ansible’s most powerful tools, but unfortunately Ansible also relies on this for pretty basic functionality. Notice in the above what’s happening. The first task checks the status of a shell command then registers it to a variable so that it can be used in the next task. I was displeased to see it took this much effort to do very basic functionality. This should be a feature of the DSL. Puppet, for instance, has a much more elegant syntax for this:

exec { ‘Ensure redis pecl module is installed’:
  command => ‘pecl install -f redis-2.2.4’,
  unless  => ‘pecl list | grep redis | awk \’{ print $2 }\’’;
}

I was initially very excited about this feature, thinking I’d use it often in interesting ways, but as it turned out I only used the feature for cases where I needed to shell out in the above pattern because a module didn’t exist for what I needed to do.

Some of the module functionality was broken up into a number of different modules, which made it difficult to figure out how to do some basic tasks. For instance, basic file operations are split between the file, copy, fetch, get_url, lineinfile, replace, stat and template modules. This was annoying when referencing documentation, where I needed to jump between modules until I found the right one. The shell/command module split is much more annoying, as command will only run basic commands and won’t warn you when it’s stripping code. A few times I wrote a task using the command module, then later changed the command being run. The new command actually required the use of the shell module, but I didn’t realize it and spent quite a while trying to figure out what was wrong with the execution.

I found the input, output, DSL and configuration formats of Ansible perplexing. Here’s some examples:

  • Ansible and inventory configuration: INI format
  • Custom facts in facts.d: INI format
  • Variables: YAML format
  • Playbooks: YAML format, with key=value format inline
  • Booleans: yes/no format in some places and True/False format in other places
  • Output for introspection of facts: JSON format
  • Output for playbook runs: no idea what format

Output for playbook runs was terse, which was generally nice. Each playbook task output a single line, except for looping, which printed the task line, then each sub-action. Loop actions over dictionaries printed the dict item with the task, which was a little unexpected and cluttered the output. There is little to no control over the output.

Introspection for Ansible was lacking. To see the value of variables in the format actually presented inside of the language it’s necessary to use the debug task inside of a playbook, which means you need to edit a file and do a playbook run to see the values. Getting the facts available was more straightforward: ‘ansible -m setup hostname’. Note that hostname must be provided here, which is a little awkward when you’re only ever going to run locally. Debug mode was helpful, but getting in-depth information about what Ansible was actually doing inside of tasks was impossible without diving into the code, since every task copies a python script to /tmp and executes it, hiding any real information.

When I finished writing the playbooks, I had the following line length/character count:

 15     48     472   service-handlers.yml
 463    1635   17185 service.yml
 27     70     555   base-handlers.yml
 353    1161   11986 base.yml
 15     55     432   playbook.yml
 873    2969   30630 total

There were 194 tasks in total.

Salt:

Salt is initially difficult. The organization of the documentation is poor and the text of the documentation is dense, making it difficult for newbies. Salt assumes you’re running in master/minion mode and uses absolute paths for its states, modules, etc.. Unless you’re using the default locations, which are poorly documented for masterless mode, it’s necessary to create a configuration file. The documentation for configuring the minion is dense and there’s no guides for normal configuration modes. States and pillars both require a ‘top.sls’ file which define what will be included per-host (or whatever host matching scheme you’re using); this is somewhat confusing at first.

Past the initial setup, Salt was straightforward. Salt’s state system has states, pillars and grains. States are the YAML DSL used for configuration management, pillars are user defined variables and grains are variables gathered from the system. All parts of the system except for the configuration file are templated through Jinja.

Developing Salt’s states was straightforward. Salt’s default mode of operation is to execute states in order, but it also has a requisite system, like Puppet’s, which can change the order of the execution. Triggering events (like restarting a service) is documented using the watch or watch_in requisite, which means that following the default documentation will generally result in out-of-order execution. Salt also provides the listen/listen_in global state arguments which execute at the end of a state run and do not modify ordering. By default Salt does not immediately halt execution when a state fails, but runs all states and returns the results with a list of failures and successes. It’s possible to modify this behavior via the configuration. Though Salt didn’t exit on errors, I found that I had errors after destroying my vagrant instance then rebuilding it at a similar rate to Ansible. That said, I did eventually set the configuration to hard fail since our team felt it would lead to more consistent runs.

My initial state definition was in a single file. Splitting this apart into base and service states was very straightforward. I split the files apart and included base from service. Salt makes no distinction between states and commands being notified (handlers in Ansible); there’s just states, so base and service each had their associated notification states in their respective files. At this point I had: top.sls, base.sls and service.sls for states. For pillars I had top.sls, users.sls and common.sls.

The use of Jinja in Salt is well executed. Here’s an example of adding users from a dictionary of users:

{% for name, user in pillar['users'].items() %}
  Ensure user {{ name }} exist:
    user.present:
      - name: {{ name }}
      - uid: {{ user.id }}
      - gid_from_name: True
      - shell: /bin/bash
      - groups:
        - vboxsf
        - syslog
        - fullname: {{ user.full_name }}
{% endfor %}

Salt uses Jinja for both state logic and templates. It’s important to note that Salt uses Jinja for state logic because it means that the Jinja is executed before the state. A negative of this is that you can’t do something like this:

Ensure myelb exists:
  boto_elb.present:
    - name: myelb
    - availability_zones:
      - us-east-1a
    - listeners:
      - elb_port: 80
        instance_port: 80
        elb_protocol: HTTP
      - elb_port: 443
        instance_port: 80
        elb_protocol: HTTPS
        instance_protocol: HTTP
        certificate: 'arn:aws:iam::879879:server-certificate/mycert'
      - health_check:
          target: 'TCP:8210'
    - profile: myprofile

{% set elb = salt['boto_elb.get_elb_config']('myelb', profile='myprofile') %}

{% if elb %}
Ensure myrecord.example.com cname points at ELB:
  boto_route53.present:
    - name: myrecord.example.com.
    - zone: example.com.
    - type: CNAME
    - value: {{ elb.dns_name }}
{% endif %}

That’s not possible because the Jinja running ’set elb’ is going to run before ‘Ensure myelb exists’, since the Jinja is always rendered before the states are executed.

On the other hand, since Jinja is executed first, it means you can wrap multiple states in a single loop:

{% for module, version in {
       ‘test’: (‘1.1.1’, 'stable'),
       ‘hello’: (‘1.2.1’, 'stable'),
       ‘world’: (‘2.2.2’, 'beta')
   }.items() %}
Ensure {{ module }} pecl module is installed:
  pecl.installed:
    - name: {{ module }}
    - version: {{ version[0] }}
    - preferred_state: {{ version[1] }}

Ensure {{ module }} pecl module is configured:
  file.managed:
    - name: /etc/php5/mods-available/{{ module }}.ini
    - contents: "extension={{ module }}.so"
    - listen_in:
      - cmd: Restart apache

Ensure {{ module }} pecl module is enabled for cli:
  file.symlink:
    - name: /etc/php5/cli/conf.d/{{ module }}.ini
    - target: /etc/php5/mods-available/{{ module }}.ini

Ensure {{ module }} pecl module is enabled for apache:
  file.symlink:
    - name: /etc/php5/apache2/conf.d/{{ module }}.ini
    - target: /etc/php5/mods-available/{{ module }}.ini
    - listen_in:
      - cmd: Restart apache
{% endfor %}

Of course something similar to Ansible’s register functionality isn’t available either. This turned out to be fine, though, since Salt has a very feature rich DSL. Here’s an example of a case where it was necessary to shell out:

# We need to ensure the current link points to src.git initially
# but we only want to do so if there’s not a link there already,
# since it will point to the current deployed version later.
Ensure link from current to src.git exists if needed:
  file.symlink:
    - name: /srv/service/current
    - target: /srv/service/src.git
    - unless: test -L /srv/service/current

Additionally, as a developer who wanted to switch to either Salt or Ansible because it was Python, it was very refreshing to use Jinja for logic in the states rather than something built into the DSL, since I didn’t need to look at the DSL specific documentation for looping or conditionals.

Salt is very consistent when it comes to input, output and configuration. Everything is YAML by default. Salt will happily give you output in a number of different formats, including ones you create yourself via outputter modules. The default output of state runs shows the status of all states, but can be configured in multiple ways. I ended up using the following configuration:

# Show terse output for successful states and full output for failures.
state_output: mixed
# Only show changes
state_verbose: False

State runs that don’t change anything show nothing. State runs that change things will show the changes as single lines, but failures show full output so that it’s possible to see stacktraces.

Introspection for Salt was excellent. Both grains and pillars were accessible from the CLI in a consistent manner (salt-call grains.items; salt-call pillar.items). Salt’s info log level shows in-depth information of what is occurring per module. Using the debug log level even shows how the code is being loaded, the order it’s being loaded in, the OrderedDict that’s generated for the state run, the OrderedDict that’s used for the pillars, the OrderedDict that’s used for the grains, etc.. I found it was very easy to trace down bugs in Salt to report issues and even quickly fix some of the bugs myself.

When I finished writing the states, I had the following word/character count:

527    1629   14553 api.sls
6      18     109   top.sls
576    1604   13986 base/init.sls
1109   3251   28648 total

There were 151 salt states in total.

Notice that though there’s 236 more lines of Salt, there’s in total fewer characters. This is because Ansible has a short format which makes its lines longer, but uses less lines overall. This makes it difficult to directly compare by lines of code. Number of states/tasks is a better metric to go by anyway, though.

Maturity

Both Salt and Ansible are currently more than mature enough to replace Puppet. At no point was I unable to continue because a necessary feature was missing from either.

That said, Salt’s execution and state module support is more mature than Ansible’s, overall. An example is how to add users. It’s common to add a user with a group of the same name. Doing this in Ansible requires two tasks:

- name: Ensure groups exist
  group: name={{ item.key }} gid={{ item.value.id }}
  with_dict: users

- name: Ensure users exist
  user: name={{ item.key }} uid={{ item.value.id }} group={{ item.key }} groups=vboxsf,syslog comment="{{ item.value.full_name }}" shell=/bin/bash
  with_dict: users

Doing the same in Salt requires one:

{% for name, user in pillar['users'].items() %}
Ensure user {{ name }} exist:
  user.present:
    - name: {{ name }}
    - uid: {{ user.id }}
    - gid_from_name: True
    - shell: /bin/bash
    - groups:
      - vboxsf
      - syslog
    - fullname: {{ user.full_name }}
{% endfor %}

Additionally, Salt’s user module supports shadow attributes, where Ansible’s does not.

Another example is installing a debian package from a url. Doing this in Ansible is two tasks:

- name: Download mypackage debian package
  get_url: url=https://s3.amazonaws.com/mybucket/mypackage/mypackage_0.1.0-1_amd64.deb dest=/tmp/mypackage_0.1.0-1_amd64.deb

- name: Ensure mypackage is installed
  apt: deb=/tmp/mypackage_0.1.0-1_amd64.deb

Doing the same in Salt requires one:

Ensure mypackage is installed:
  pkg.installed:
    - sources:
    - mypackage: https://s3.amazonaws.com/mybucket/mypackage/mypackage_0.1.0-1_amd64.deb

Another example is fetching files from S3. Salt has native support for this where files are referenced in many modules, while in Ansible you must use the s3 module to download a file to a temporary location on the filesystem, then use one of the file modules to manage it.

Salt has state modules for the following things that Ansible did not have:

  • pecl
  • mail aliases
  • ssh known hosts

Ansible had a few broken modules:

  • copy: when content is used, it writes POSIX non-compliant files by default. I opened an issue for this and was marked as won’t fix. More on this in the Community section.
  • apache2_module: always reports changes for some modules. I opened an issue it was marked as a duplicate. Fix in a pull request, open as of this writing with no response since June 24, 2014.
  • supervisorctl: doesn’t handle a race condition properly where a service starts after it checks its status. Fix in a pull request, open as of this writing with no response since June 29, 2014. Unsuccessfully fixed in a pull request on Aug 30, 2013, issue still marked as closed, though there are reports of it still being broken.

Salt had broken modules as well, both of which were broken in the same way as the Ansible equivalents, which was amusing:

  • apache_module: always reports changes for some modules. Fixed in upcoming release.
  • supervisorctl: doesn’t handle a race condition properly where a service starts after it checks its status. Fixed in upcoming release.

Past basic module support, Salt is more far more feature rich:

  • Salt can output in a number of different formats, including custom ones (via outputters)
  • Salt can output to other locations like mysql, redis, mongo, or custom locations (via returners)
  • Salt can load its pillars from a number of locations, including custom ones (via external pillars)
  • If running an agent, Salt can fire local events that can be reacted upon (via reactors); if using a master it’s also possible to react to events from minions.

Performance

Salt was faster than Ansible for state/playbook runs. For no-change runs Salt was considerably faster. Here’s some performance data for each, for full runs and no-change runs. Note that these runs were relatively consistent across large numbers of system builds in both vagrant and AWS and the full run times were mostly related to package/pip/npm/etc installations:

Salt:

  • Full run: 12m 30s
  • No change run: 15s

Ansible:

  • Full run: 16m
  • No change run: 2m

I was very surprised at how slow Ansible was when making no changes. Nearly all of this time was related to user accounts, groups, and ssh key management. In fact, I opened an issue for it. Ansible takes on average .5 seconds per user, but this extends to other modules that use loops over large dictionaries. As the number of users managed grows our no-change (and full-change) runs will grow with it. If we double our managed users we’ll be looking at 3-4 minute no-change runs.

I mentioned in the Simplicity/Ease of Use section that I had started this project by developing with Ansible and then re-implementing in Salt, but as time progressed I started implementing in Salt while Ansible was running. By the time I got half-way through implementing in Ansible I had already finished implementing everything in Salt.

Community

There’s a number of ways to rate a community. For Open Source projects I generally consider a few things:

  1. Participation

In terms of development participation Salt has 4 times the number of merged pull requests (471 for Salt and 112 for Ansible) in a one month period at the time of this writing. It also three times the number of total commits. Salt is also much more diverse from a perspective of community contribution. Ansible is almost solely written by mpdehaan. Nearly the top 10 Salt contributors have more commits than the #2 committer for Ansible. That said, Ansible has more stars and forks on GitHub, which may imply a larger user community.

Both Salt and Ansible have a very high level of participation. They are generally always in the running with each other for the most active GitHub project, so in either case you should feel assured the community is strong.

  1. Friendliness

Ansible has a somewhat bad reputation here. I’ve heard anecdotal stories of people being kicked out of the Ansible community. While originally researching Ansible I had found some examples of rude behavior to well meaning contributors. I did get a “pull request welcome” response on a legitimate bug, which is an anti-pattern in the open source world. That said, the IRC channel was incredibly friendly and all of the mailing list posts I read during this project were friendly as well.

Salt has an excellent reputation here. They thank users for bug reports and code. They are very receptive and open to feature requests. They respond quickly on the lists, email, twitter and IRC in a very friendly manner. The only complaint that I have here is that they are sometimes less rigorous than they should be when it comes to accepting code (I’d like to see more code review).

  1. Responsiveness

I opened 4 issues while working on the Ansible port. 3 were closed won’t fix and 1 was marked as a duplicate. Ansible’s issue reporting process is somewhat laborious. All issues must use a template, which requires a few clicks to get to and copy/paste. If you don’t use the template they won’t help you (and will auto-close the issue after a few days).

Of the issues marked won’t fix:

  1. user/group module slow: Not considered a bug that Ansible can do much about. Issue was closed with basically no discussion. I was welcomed to start a discussion on the mailing list about it. (For comparison: Salt checks all users, groups and ssh keys in roughly 1 second)
  2. Global ignore_errors: Feature request. Ansible was disinterested in the feature and the issue was closed without discussion.
  3. Content argument of copy module doesn’t add end of file character: The issue was closed won’t fix without discussion. When I linked to the POSIX spec showing why it was a bug the issue wasn’t reopened and I was told I could submit a patch. At this point I stopped submitting further bug reports.

Salt was incredibly responsive when it comes to issues. I opened 19 issues while working on the port. 3 of these issues weren’t actually bugs and I closed them on my own accord after discussion in the issues. 4 were documentation issues. Let’s take a look at the rest of the issues:

  1. pecl state missing argument: I submitted an issue with a pull request. It was merged and closed the same day.
  2. Stacktrace when fetching directories using the S3 module: I submitted an issue with a pull request. It was merged the same day and the issue was closed the next.
  3. grains_dir is not a valid configuration option: I submitted an issue with no pull request. I was thanked for the report and the issue was marked as Approved the same day. The bug was fixed and merged in 4 days later.
  4. Apache state should have enmod and dismod capability: I submitted an issue with a pull request. It was merged and closed the same day.
  5. The hold argument is broken for pkg.installed: I submitted an issue without a pull request. I got a response the same day. The bug was fixed and merged the next day.
  6. Sequential operation relatively impossible currently: I submitted an issue without a pull request. I then went into IRC and had a long discussion with the developers about how this could be fixed. The issue was with the use of watch/watch_in requisites and how it modifies the order of state runs. I proposed a new set of requisites that would work like Ansible’s handlers. The issue was marked Approved after the IRC conversation. Later that night the founder (Thomas Hatch) wrote and merged the fix and let me know about it via Twitter. The bug was closed the following day.
  7. Stacktrace with listen/listen_in when key is not valid: This bug was a followup to the listen/listen_in feature. It was fixed/merged and closed the same day.
  8. Stacktrace using new listen/listen_in feature: This bug was an additional followup to the listen/listen_in feature and was reported at the same time as the previous one. It was fixed/merged and closed the same day.
  9. pkgrepo should only run refresh_db once: This is a feature request to save me 30 seconds on occasional state runs. It’s still open at the time of this writing, but was marked as Approved and the discussion has a recommended solution.
  10. refresh=True shouldn’t run when package specifies version and it matches. This is a feature request to save me 30 seconds on occasional state runs. It was fixed and merged 24 days later, but the bug still shows open (it’s likely waiting for me to verify).
  11. Add an enforce option to the ssh_auth state: This is a feature request. It’s still open at the time of this writing, but it was approved the same day.
  12. Allow minion config options to be modified from salt-call: This is a feature request. It’s still open at the time of this writing, but it was approved the same day and a possible solution was listed in the discussion.

All of these bugs, except for the listen/listen_in feature could have easily been worked around, but I felt confident that if I submitted an issue the bug would get fixed, or I’d be given a reasonable workaround. When I submitted issues I was usually thanked for the issue submission and I got confirmation on whether or not my issue was approved to be fixed or not. When I submitted code I was always thanked and my code was almost always merged in the same day. Most of the issues I submitted were fixed within 24 hours, even a relatively major change like the listen/listen_in feature.

  1. Documentation

For new users Ansible’s documentation is much better. The organization of the docs and the brevity of the documentation make it very easy to get started. Salt’s documentation is poorly organized and is very dense, making it difficult to get started.

While implementing the port, I found the density of Salt’s docs to be immensely helpful and the brevity of Ansible’s docs to be be infuriating. I spent much longer periods of time trying to figure out the subtleties of Ansible’s modules since they were relatively undocumented. Not a single module has the variable registration dictionary documented in Ansible, which required me to write a debug task and run the playbook every time I needed to register a variable, which was annoyingly often.

Salt’s docs are unnecessarily broken up, though. There’s multiple sections on states. There’s multiple sections on global state arguments. There’s multiple sections on pillars. The list goes on. Many of these docs are overlapping, which makes searching for the right doc difficult. The split of execution modules and state modules (which I rather enjoy when doing salt development) make searching for modules more difficult when writing states.

I’m a harsh critic of documentation though, so for both Salt and Ansible, you should take this with a grain of salt (ha ha) and take a look at the docs yourself.

Conclusion

At this point both Salt and Ansible are viable and excellent options for replacing Puppet. As you may have guessed by now, I’m more in favor of Salt. I feel the language is more mature, it’s much faster and the community is friendlier and more responsive. If I couldn’t use Salt for a project, Ansible would be my second choice. Both Salt and Ansible are easier, faster, and more reliable than Puppet or Chef.

As you may have noticed earlier in this post, we had 10,000 lines of puppet code and reduced that to roughly 1,000 in both Salt and Ansible. That alone should speak highly of both.

After implementing the port in both Salt and Ansible, the Lyft DevOps team all agreed to go with Salt.

152 Comments

  1. Hi,

    thanks for the write up. One thing I’m curious about; in your conclusion you state ‘Both Salt and Ansible are easier, faster, and more reliable than Puppet or Chef.’ This is the only reference to Chef in the entire post. Can you elaborate on how you reached that conclusion related to Chef specifically?

    I’ll spare you the ‘we use Chef to manage a jillion nodes for giant outfits blah blah’ bit, but I am curious as in my experience, whatever Chef’s shortcomings might be, reliability isn’t one of them, so I’d really be interested to hear more about this.

    cheers

    Reply

  2. […] You may have to pay off some debt with a re-write (hope you didn’t have too much code, like 10,000 lines of Puppet code or anything), or carry that debt by being tied to an older version of Puppet – and Ruby. Who […]

    Reply

  3. I don’t understand why Puppet and Chef are getting so easily dismissed because they have a central server. In large disparate/or collaborative environments I like the idea of centralized server that actually stores and runs the changes. At this point you reduce regressions introduced by person A making a change that person B accidentally reverts. At some point you need to orchestrate your system so that changes come from a single source whether its a chef server or a CI server(or both) something needs to be a gateway to your infrastructure. You can definitely achieve this with git branch/merge and CI server but if you have a weird merge conflict that takes some time to debug you run the risk of not being able to bootstrap with confidence and merge conflicts will happen often. Just because out of the box ansible doesn’t have a master doesn’t it make it better. chef-solo and chef-zero can achieve the same things without a master server. In fact, I find that many ansible users soon find they need centralization at some point when they start doing deployments en masse and have to either write their own CI wrappers scripts or use Tower.

    Reply

    1. Note that we’re using Salt, not Ansible, but I think your points can likely apply to both.

      I’m not dismissing Chef and Puppet because they use a master. We made a decision early in the process to eliminate masters because they make config management and orchestration more complicated in a lot of ways, especially in an AWS multi-AZ architecture where you assume you can always lose an entire AZ. One of the more annoying things to deal with is auto-scaling in a mastered environment.

      We have everything built around the concept of a deployment, and the majority of the code is located within each service’s repo. We have shared non-service specific code in a central repo and we can deploy that separately from our service repos. Merge conflicts are simply not a problem :).

      There’s a possibility we’ll use a master in the future (that just handles remote execution, events, etc), but as-is right now we don’t see a strong need for it.

      This article was really meant as a comparison of Salt and Ansible. It just happens that both were able to complete a real world example in far fewer LOC than Puppet (in a way that was clearer and simpler for everyone to understand). Salt and Ansible were also much faster running than Puppet (with Salt being far faster than the other two). Based on that I thought I’d share that data along with this article.

      Reply

  4. How much nodes you’re running one? Both puppet and chef are targeted for running really big environments, but anyway here’s short resume of the article: “We had a bunch of legacy shitcode, developed by undefined amount of ninjas from time to time, and instead of actually learn how to use Puppet, we switch to some new hipster’s toys.” Let’s talk after couple of years, when you’ve got 10000 line of code for salt.
    BTW, I don’t believe that you’re able to reduce code to 1000 from 10000. Sound’s like a marketing to me.

    Reply

    1. I have about 5 years of experience with Puppet and have done some pretty cool things with it in the past (just look at older posts from this blog). I’ve been a speaker at puppetconf as well ;). As mentioned in earlier comments we could have likely reduced our Puppet LOC by a lot by finishing up our Puppet refactor, but not by a lot. Puppet was only managing about 10 services at the time. We’re running 40-50 services now and have about 9k LOC of Salt. Most of our services are roughly 20-100 Salt LOC.

      Reply

      1. Hi Ryan,

        since you used Puppet for so long (and I watched your talk at puppetconf ’12) – it would be really great if you could write in-detail comparison from your perspective – Puppet vs Salt, cause this article doesn’t really cover the reasons for migration objectively.

        Reply

        1. Our major reason for moving is that we needed to do a major refactor that was close to the amount of work it would take for a rewrite and most of the people on our team didn’t like puppet very much. Once I did a comparison of Salt and Ansible everyone was in pretty strong favor of moving since I handled most of the work necessary to move everything just by doing the proof of concept.

          Maybe I can do a more in-depth comparison of Salt and Puppet, but I don’t currently have a really strong motivation to do so like I did with Salt and Ansible.

          Reply

  5. Thanks for sharing your perspective on the two. I’m currently investigating which configuration management tool would be best for my environment. Leaning towards Ansible, but after reading this I think I’ll give Salt a bit more consideration.

    Reply

  6. Super article! I just have a question for you : what tool do you usually use for provisioning? Because it seems that it is usually something Configuration Managers don’t take care of.

    Reply

  7. This writeup was really helpful. I’ve book marked it as I have a feeling I’ll come back to it for reference to how you did things with salt. Thank you for putting all of this together for us :)

    Reply

  8. Why couldn’t you just:

    package the application
    design a generic variable expansion library (using env(1), for instance)
    design a generic self-assembly facility packages can call, and which calls 2. to expand the variables in configuration fragments
    package the above framework
    make configuration OS packages, which drop the fragments into /etc/opt/app.d/, and call 2. to expand the variables in the template, and then 3. to assemble the expanded files into the config file in the application’s target location?

    The OS packages can call the expansion and self assembly framework in their postinstall phase. The definitions for the expansion could come from a RDBMS (like Oracle, PostgreSQL, or SQLite), so that standard query and update mechanism could be leveraged, and the code doing all of this could be sequentially written, simple shell code. I did this and it works phenomenally, with the additional benefit that 2nd level support can query the OS to tell them exactly what has been done to the system at any given point in time; because everything is packaged, nobody ever hacks on the systems interactively.

    Also, because everything is packaged, the 2nd level support does not need to know the intimate details of each and every component, since that knowledge is encapsulated into the package.

    For instance: Oracle database instances are created by mass-deploying an OS database configuration package. This package calls the variable expansion framework (also packaged), and then calls the appropriate database creation program (also packaged). In 45 minutes, we are garanteed to get a running Oracle instance, waiting to have data imported and start serving it.

    Best of all: no special software needed. Maximum speed. Fully capability maturity model level 2 compliant. OS software subsystem compliant. Directly pluggable into JumpStart, Kickstart, NIM, AutoYaST, Ignite-UX, …

    Reply

    1. Configuration packages are a nightmare. An absolute horrible nightmare that I’ll never return to. It’s worse than config management via bash scripts (and yes, that’s an intentional jab at dockerfiles). This was a common practice before puppet existed. You can quickly and easily get into dependency hell or run into situations where one package uninstalls all of your other packages. Additionally, building packages on basically every platform sucks and few people know how to do it. Almost no one knows how to do it correctly. For instance, by default on Ubuntu/Debian if your package puts a file in /etc it’s a config file and won’t be overwritten by newer packages, unless you force it to. That bit of black magic is great for packaging software, but is nightmarish when doing config management.

      If you’re looking at doing something like this, I really recommend using docker (or LXD or rocket) where you get all the benefits you’re mentioning while also being completely immutable. That said, I don’t recommend dockerfiles. I recommend generating docker images by running Salt (or Ansible, or Puppet or Chef), then doing a commit of the container to make your image. All the benefits of immutable infrastructure with the friendliness of proper config management.

      Reply

        1. There’s lots of reasons not to, which I mentioned in my last comment :). Packages aren’t a good method of configuration management, they’re a good method of bundling and distributing software that has a limited amount of dependencies on other pieces of software (which are usually libraries). Once you start adding configuration in there, it gets painful since it’s partially working against the system.

          Reply

          1. The previous comment was directed specifically at organization’s ability to create packages. I just don’t buy the argument in 2015 that packaging is hard and we should not do it because of that. Other reasons listed are certainly valid (config can come from ERB templates and such).

      1. “Configuration packages are a nightmare. An absolute horrible nightmare that I’ll never return to. It’s worse than config management via bash scripts (and yes, that’s an intentional jab at dockerfiles).”

        Then I would say that something has gone wrong, terribly wrong. I guess you missed the “I did this and it works phenomenally” part, especially the phenomenally part. I should have also added “and at scale, too”, but that is purely my fault.

        “You can quickly and easily get into dependency hell or run into situations where one package uninstalls all of your other packages. ”

        HUH???
        Why would you make packages which uninstall (or install) other packages? That is the job of a software deployment server software, not the software management subsystem in the OS, and by that I mean that the aforementioned software must be able to understand the concept of components and bundles. For the record, it has been done before, and it also works very well. The problem is that it is proprietary software at a certain company which will never see the light of day because the aforementioned company is not in the software business, but it proved beyond doubt that it can scale and of course that it works. That, however, does not stop you or anyone else designing and implementing such software. In fact, it is the final, missing piece in all of these configuration management attempts.

        Also, my configuration packages can NEVER uninstall or remove other packages, or even individual files, if those files are shared among packages. (And convinently, SVR4 packaging prevents such things with a global lock, so that all installations and removals are sequenced. I believe, but have never tried, that RPM does the same.)

        You see, one of the fundamental weaknesses, architectural weaknesses, of RPM for instance, is that packages may not share files. For instance, two packages are forbidden to claim the same file. All other software management subsystems have no problem with this. Since I use Solaris and SVR4 packaging, pkgrm(1M) will never remove a file also claimed by another package, but using the libraries I developed, it will know exactly which configuration lines in a claimed configuration file belong to it and remove them, without removing the file completely. All the configuration overlay packages leverage the same library, and know exactly which lines in a configuration file are their own. So I can do it both ways, either by injecting variable expanded configuration lines in a file, or by calling the self-assembly SMF method I developed to expand the variables in the configuration fragments, then generate a target configuration file from those fragments, then use installf(1M) to register a claim on the target file so other configuration overlay packages do not delete it when they remove their own excerpts, or remove their fragments from /etc/opt/app.d/ and re-generate the target file. It’s a joy to use, and you should see how well it scales!

        This is where SVR4 packaging and the well thought out arhcitecture trumps RPM, deb and anything else that is trendy these days: the variables in the pkginfo(4) file are propagated into the environment during installation and removal, so the packages can be parametrized, and everything else can be computed at installation time, and installf(1M) and removef(1M) allow me to register files which must be dynamically generated during installation with the software management subsystem, cleanly. Well thought out architecture from the old UNIX masters trumps trendyness and shiny new toys.

        However, I have successfully emulated the same with %define in RPM, and installf(1M) can be emulated, to a point, with %ghost directive in RPM, but it’s a lot more work, which is why it makes sense to stick with a cutting-edge Solaris-based distro, like SmartOS, which is specifically designed for running large clouds at scale.

        The point here is that if well thought out (and if the implementor has the requisite large scale system engineering and architecture experience) the end product can be simple, robust, scale, and not require a domain specific language or extra software. No technology can compensate for well thought-out architecture and engineering. Such a thing does not exist, and is unlikely to ever exist in the future.

        All my configuration overlay packages use simple shell and AWK code; in fact, the most complex code ends up being a little shell control logic wrapped around AWK doing the heavy lifting. It does not get any simpler than that, and these languages are built into the OS and, with the added benefit that they are well understood, and simple.

        “If you’re looking at doing something like this, I really recommend using docker (or LXD or rocket) where you get all the benefits you’re mentioning while also being completely immutable.”

        I am not looking at doing it, I have been doing it for the past eight years, at scale!!!

        And since I have Solaris and SmartOS, I have the benefit of Solaris zones, backed by the rock solid ZFS, running full-fledged, fully functional containers at bare metal speeds. Compared to Solaris or SmartOS zones, docker is a feeble attempt at copying what zones offer. For those really unportable applications, I can run them in an lx branded zone on SmartOS. The Linux app won’t have a clue that it’s running on a Solaris kernel, and I get the bare metal speed and a fully functional container, without needing to do the extra basic plumming engineering like phusion did with phusion/baseimage-docker (also see the PID 1 reaping problem in docker containers). Phusion had to put in extra engineering on the OS level (docker) so that they would get close to what anyone using Solaris zones in Solaris or SmartOS gets built-in, for free.

        Well thought out architecture and smart development on rock solid technology beats trendyness any day of the week. Customers do not, and will never know that they are running on an Illumos (Solaris) kernel, and I get to enjoy developing on a soundly engineered platform on which data integrity is not an afterthought. It doesn’t get any better than that. Best of all, no Puppet, Chef, Salt, or Ansible required.

        Reply

        1. I’m not saying that’s it’s impossible to do flawlessly, I’m saying that it’s difficult to do flawlessly. This is especially true if you consider cross-distro management, which config management tools do very well.

          Even in 2015 dealing with packages is difficult. You happen to know it well enough for it to work really well for your org. If we both took a new developer and trained them to use each of our solutions I’d bet good money the developer would be more productive in a quicker amount of time with the config management solution.

          Anyway, this is outside the topic of this post, so I’ll stop discussing it here. :)

          Reply

  9. Great article, thanks !!
    Are you able to write anything about bare metal – using cobbler or foreman?

    Thanks!

    Reply

  10. Hi Ryan,
    Your article was very instrumental in deciding to use the configuration management tool for deployment activities in production. I found Saltstack to be extremely flexible. ‘Automation’ for me is “a programming language with ‘state mechanism’ glued into it”. I found such attributes in SaltStack. My ~400 lines of saltstack code is able to manage 5 different Datacenter locations each one with different software configuration (all software in-house developed ). I was really amazed to find that , in the end it turned out to be ‘infrastrcure as a data’.
    It had a json file per DC listing only machine names and name of services, thats it. Rest of the job is done by Saltstack. Much easier to deploy new servers.
    Chef might be another contender to it. But it was not very straightforward to set it up and run it.

    Reply

  11. Thanks for sharing your experiences.

    I’m an devops consultant so often i don’t get to choose the config management tool.
    However I always develop without a master using test-kitchen as it has support to do push
    for chef, puppet, ansible and salt as well as the ability to write tests and do TDD.

    To be frank i think it matters more how you use the tools rather than which tool you use.

    I much prefer to do masterless as it makes it less complex and more lightweight whether it is chef, puppet,
    ansible or salt.
    I follow the library application pattern and heavily use the puppetforge, chef community or ansible galaxy
    with librarian=puppet, librarian-ansible or berkshelf as this reduces the size of my repository considerably.
    If i cannot find anything then i write the generic part separately and put in puppetforge, chef community or ansible galaxy.
    I much prefer using a cloud like AWS as you can use the AWS services (cloudwatch, elb, autoscaling etc)
    to reduce the amount of stuff you have to build yourself.
    I prefer doing immutable infrastructure as this works well without a master.

    Unfortunately currently on the job market in UK at least salt is the least used so as a consultant
    I have to go with what the companes use and puppet is still the most popular. -)

    P.S. Checkout test-kitchen with salt.

    P.P.S Using masterless puppet v3 with heira is much better than using puppetmaster v2.

    Reply

  12. Ansible is a great product, but irrationally priced. They wanted $35K per year to increase to a 500 server limit. Contracts are for one year and you cannot cancel early. And if you don’t give them 90 days notice, they will auto-renew for another year. I despise such coercive business practices.

    Reply

  13. Hi Im currently evaluating all of these tools and am coming from a Puppet background. The one things that has struck me about Ansible is the simple work flow ability via playbooks to create dependancies between tasks to help multi tier service build outs which is a bit like CloudFormation on AWS meaning I can not only use it to manage config on my hosts but I can quickly build datacentres services of many types of infrastructure as well which will not only help me do full service automated testing e.g task #1 (build presentation layer nodes) task #2 (build database tier nodes) #task 3 (if task #1 & task #2 successfull start automated end to end service test). I may be missing something but I do not see how Chef, Puppet or Salt can do this without usoing a seperate orchestrator. As such I amn interested in hearing how the other tools would achieve this higher level of Orchestrasion coupled with the Config Management function they all do well to varying degrees. many Thanks Paul

    Reply

    1. It’s definitely possible with any of the tools. I’m doing this right now with SaltStack; see:

      http://ryandlane.com/blog/2015/04/02/saltconf15-masterless-saltstack-at-scale-talk-and-slides/
      http://ryandlane.com/blog/2014/08/26/saltstack-masterless-bootstrapping/

      SaltStack, in Master/Minion mode has an orchestration module that’s meant to do exactly what you describe. I happen to be doing in a masterless way that’s eventually consistent, but you can do it the fully consistent manner that you want as well.

      In general I’ve purposely avoided this type of workflow in my infrastructure, though. Things are a lot easier if you can say “here’s what everything looks like” and everything eventually makes it into that state and reports back when its done. The nice thing about doing things this way is that the majority of the steps can be done in parallel and it also makes making changes way easier.

      Reply

      1. Thanks Ryan that’s exactly what I was after and has nudged Salt a good few rungs up my selection ladder so now I have my top two. let the battle commence : )

        Reply

Leave a Reply

Your email address will not be published. Required fields are marked *