Grafana dashboard orchestration using SaltStack

As mentioned in my post on Cloudwatch alarms, we (Lyft) believe that it should be easy to do the right thing and difficult to do the wrong thing. We operate on the concept “If you build it, you run it.” Running your own service isn’t easy, if you don’t have the right tools to help you, though.

We’re using Graphite and Grafana for time series data dashboards. With a consistent configuration management pattern all new services start with their data flowing into Graphite. Dashboard management is tricky, though. We encourage teams to add custom metrics to their services and use them in panels and rows for their services, but we also want to provide a number of consistent panels/rows for all services. We also want to avoid making teams go between multiple dashboards to monitor their own services.

To make it easy for services to manage their own dashboards we’re using Grafana backed with Elasticsearch. Teams can add new metrics to their services, then add rows and panels to their dashboards. Our services are very consistent, though, and there’s a number of dashboards that basically all services need, and a subset of dashboards that services of a specific type need. So, what we want is a set of managed dashboards that can easily defined in code.

To handle this, we’ve added a grafana state module and an elasticsearch execution module to the 2015.2 SaltStack release (in release candidate at the time of this writing). The Grafana state lets you manage rows in dashboards. In the case no dashboard exists the module will create the dashboard, but will only manage rows after that point. Dashboards and rows can be defined directly through the state, but since dashboard definitions can be verbose (and laborious to define) it’s also possible to define them through specified pillar keys, or through default pillar keys.

Here’s an example of defining a dashboard through a state:

    Ensure myservice dashboard is managed:
      grafana.dashboard_present:
        - name: myservice
        - dashboard:
            annotations:
              enable: true
              list: []
            editable: true
            hideAllLegends: false
            hideControls: false
            nav:
              - collapse: false
                enable: true
                notice: false
                now: true
                refresh_intervals:
                  - 10s
                  - 30s
                  - 1m
                  - 5m
                  - 15m
                  - 30m
                  - 1h
                  - 2h
                  - 1d
                status: Stable
...
        - rows:
            - collapse: false
              editable: true
              height: 150px
              title: System Health
              panels:
                - aliasColors: {}
                  id: 200000
                  annotate:
                    enable: false
                  bars: false
                  datasource: null
                  editable: true
                  error: false
                  fill: 7
                  grid:
                    leftMax: 100
                    leftMin: null
                    rightMax: null
                    rightMin: null
                    threshold1: 60
                    threshold1Color: rgb(216, 27, 27)
...

This is just a small excerpt from what would be a very, very long dashboard definition. Adding this to every service would be really painful and difficult to maintain. So, let’s move this into the pillars:

grafana.sls:

grafana_dashboards:
  default:
    annotations:
      enable: true
      list: []
    editable: true
    hideAllLegends: false
    hideControls: false
    nav:
      - collapse: false
        enable: true
        notice: false
        now: true
        refresh_intervals:
          - 1m
          - 5m
          - 15m
          - 30m
...

grafana_rows:
  service:
    - collapse: false
      editable: false
      height: 25px
      title: "Panels/rows marked with (M) are managed by orchestration. Don't edit them!"
      panels: []
      showTitle: true
    - collapse: false
      editable: false
      height: 150px
      title: {{ grains.service_name }} (M)
      panels:
        - aliasColors: {}
          aliasYAxis: {}
          annotate:
            enable: false
          bars: false
          datasource: null
          editable: false
...
  systemhealth:
    - collapse: false
      editable: false
      height: 150px
      title: System Health (M)
      showTitle: true
      panels:
        - aliasColors: {}
          annotate:
            enable: false
          bars: false
...

Notice that we’re making it possible to define multiple dashboards and multiple rows, by making them keys in the related dictionaries. Let’s see how this is used:

Ensure {{ grains.service_name }} grafana dashboard is managed:
  grafana.dashboard_present:
    - name: {{ grains.service_name }}
    - dashboard_from_pillar: 'grafana_dashboards:default'
    - rows_from_pillar:
      - 'grafana_rows:service'
      - 'grafana_rows:systemhealth'
      ...

Now with a very small amount of code in a service’s orchestration, the service can have a default dashboard with a managed set of rows. The best part is that if we need to modify these rows we can now modify them in a single place and all services will have their dashboards updated to look like every other service.

We’re really excited to share this back with the community and hope that people will enjoy it and contribute back with features they’d like added. Here’s one example of an addition we’d love to see:

It would be nice to be able to define the dashboards through file templates, rather than just through pillars, since you can pass context from the state into file templates, whereas it’s not possible to do so through pillars.


Want to help us write and upstream software like this? Apply for a position at Lyft. If you want to work directly with me, apply for a DevOps Engineer, Senior DevOps Engineer, or Senior Platform Engineer position.

  • This is great!

    I also put together some proof-of-concept graphing with Salt beacons just the other day, with Carbon and Graphite. Here’s the PR in question: https://github.com/saltstack/salt/pull/22305

    I’d love more community feedback, recommendations on how we can extend Salt in this area.