As mentioned in my post on Cloudwatch alarms, we (Lyft) believe that it should be easy to do the right thing and difficult to do the wrong thing. We operate on the concept “If you build it, you run it.” Running your own service isn’t easy, if you don’t have the right tools to help you, though.
We’re using Graphite and Grafana for time series data dashboards. With a consistent configuration management pattern all new services start with their data flowing into Graphite. Dashboard management is tricky, though. We encourage teams to add custom metrics to their services and use them in panels and rows for their services, but we also want to provide a number of consistent panels/rows for all services. We also want to avoid making teams go between multiple dashboards to monitor their own services.
To make it easy for services to manage their own dashboards we’re using Grafana backed with Elasticsearch. Teams can add new metrics to their services, then add rows and panels to their dashboards. Our services are very consistent, though, and there’s a number of dashboards that basically all services need, and a subset of dashboards that services of a specific type need. So, what we want is a set of managed dashboards that can easily defined in code.
To handle this, we’ve added a grafana state module and an elasticsearch execution module to the 2015.2 SaltStack release (in release candidate at the time of this writing). The Grafana state lets you manage rows in dashboards. In the case no dashboard exists the module will create the dashboard, but will only manage rows after that point. Dashboards and rows can be defined directly through the state, but since dashboard definitions can be verbose (and laborious to define) it’s also possible to define them through specified pillar keys, or through default pillar keys.
Here’s an example of defining a dashboard through a state:
Ensure myservice dashboard is managed:
grafana.dashboard_present:
- name: myservice
- dashboard:
annotations:
enable: true
list: []
editable: true
hideAllLegends: false
hideControls: false
nav:
- collapse: false
enable: true
notice: false
now: true
refresh_intervals:
- 10s
- 30s
- 1m
- 5m
- 15m
- 30m
- 1h
- 2h
- 1d
status: Stable
...
- rows:
- collapse: false
editable: true
height: 150px
title: System Health
panels:
- aliasColors: {}
id: 200000
annotate:
enable: false
bars: false
datasource: null
editable: true
error: false
fill: 7
grid:
leftMax: 100
leftMin: null
rightMax: null
rightMin: null
threshold1: 60
threshold1Color: rgb(216, 27, 27)
...
This is just a small excerpt from what would be a very, very long dashboard definition. Adding this to every service would be really painful and difficult to maintain. So, let’s move this into the pillars:
grafana.sls:
grafana_dashboards:
default:
annotations:
enable: true
list: []
editable: true
hideAllLegends: false
hideControls: false
nav:
- collapse: false
enable: true
notice: false
now: true
refresh_intervals:
- 1m
- 5m
- 15m
- 30m
...
grafana_rows:
service:
- collapse: false
editable: false
height: 25px
title: "Panels/rows marked with (M) are managed by orchestration. Don't edit them!"
panels: []
showTitle: true
- collapse: false
editable: false
height: 150px
title: {{ grains.service_name }} (M)
panels:
- aliasColors: {}
aliasYAxis: {}
annotate:
enable: false
bars: false
datasource: null
editable: false
...
systemhealth:
- collapse: false
editable: false
height: 150px
title: System Health (M)
showTitle: true
panels:
- aliasColors: {}
annotate:
enable: false
bars: false
...
Notice that we’re making it possible to define multiple dashboards and multiple rows, by making them keys in the related dictionaries. Let’s see how this is used:
Ensure {{ grains.service_name }} grafana dashboard is managed:
grafana.dashboard_present:
- name: {{ grains.service_name }}
- dashboard_from_pillar: 'grafana_dashboards:default'
- rows_from_pillar:
- 'grafana_rows:service'
- 'grafana_rows:systemhealth'
...
Now with a very small amount of code in a service’s orchestration, the service can have a default dashboard with a managed set of rows. The best part is that if we need to modify these rows we can now modify them in a single place and all services will have their dashboards updated to look like every other service.
We’re really excited to share this back with the community and hope that people will enjoy it and contribute back with features they’d like added. Here’s one example of an addition we’d love to see:
It would be nice to be able to define the dashboards through file templates, rather than just through pillars, since you can pass context from the state into file templates, whereas it’s not possible to do so through pillars.
Want to help us write and upstream software like this? Apply for a position at Lyft. If you want to work directly with me, apply for a DevOps Engineer, Senior DevOps Engineer, or Senior Platform Engineer position.