Demonware Automates Alerting and Work Prioritization
This is a note of a presentation made by Lisa Reilly from Demonware, at Metlife’s Galway offices at the Atlassian User Group for Galway on the 23rd Feb 2018.
Demonware is responsible for developing and running the services and infrastructure for some of the largest entertainment franchises in the world. They work side by side with studios at Activision to make radical game designs a reality at massive scale. Demonware ensure that those features keep running 24/7 for years after launch.
Lisa Reilly is part of the Platform engineering department and is the Senior Project Manager for the monitoring function. Part of their responsibility includes NOC style coverage of server or service alerts and service or title outages. As Demonware are responsible for Activision Publishing’s online services, this means managing thousands of servers and dozens of services across multiple data centres.
The Monitoring team were historically dealing with alerts from multiple sources – Nagios, email, Zendesk, dashing, Graphite and multiple others. Tickets to track this work were created manually be engineers in Zendesk. The huge volume of inbound data to track and observe meant that on some shifts, the team had little agility to perform any other tasks – constant context switching was a huge drain. As they were creating Zendesk tickets on the fly also, these tickets were inconsistent and not useful for reporting or tracking error patterns.
One of the first steps the team took was to move from using Zendesk to JIRA for tracking their own work. Tickets were still manually created but could now follow templates, and reporting benefited from the addition of custom fields.
The team reached out to its internal customers and with them began rewriting service checks in Sensu and using Stackstorm rules to decide with to do when a check failed. Instead of polling multiple different check types, everything is now routed through Stackstorm. The team provided guidance on how to write more meaningful checks – so that each failure could trigger an action that the engineer on shift could take.
When a check fails, a ticket is created in an Alert queue for the team in JIRA. Early on in the project, the tooling would a failure would post a message to Slack and the engineer would decide whether it needed a ticket or not and click a button to create one. This was valuable throughout the project while failing checks were still written in the “old” style, as the volume of them was very high due to historical checks not being updated or having incorrect thresholds set. The team wanted to avoid the problem of having hundreds of in-actionable tickets to deal with.
Throughout this process, useful checks were added and in-actionable ones were deprecated. Once the team had a stream of actionable failures, they automated the creation of tickets in JIRA. Around this time, they also moved to using Jira Service Desk. JIRA Service Desk gave the team powerful reporting that they had not access to before.
JIRA Service Desk made it possible to set SLAs based on a ticket’s priority – they could automatically set the priority of the ticket based on the importance of the failing check. As all of the previous data inputs now had a check to show failure, they now had one queue to work to look at. This meant having one JIRA Service Desk browser tab open instead of multiple tabs and screens. The reporting means the team can quickly create reports without any scripting to show troublesome hosts, checks, services and so on. It also gives insight into which shifts are most busiest and when and how many interrupts there are throughout a usual shifts. This makes sprint planning for the team more effective, as it was previously impossible to judge sprint capacity.
The team use JIRA to Slack integration to post details of new tickets to a chat channel. They have written an automation to allow chatops, so they can update JIRA tickets from within slack with the ticket ID and simple commands. This makes assigning, commenting and transitioning tickets easy if an engineer is away from their desk or notice a new ticket notification!
For paging escalations, the team have integrated JIRA with PagerDuty using a plugin called Notification Assistant. This provides similar functionality to the popular JIRA Automation plugin. As the alert volume has decreased for the team, they also use PagerDuty themselves to receive pages before any SLA breaches in JIRA Service Desk. This has meant that throughout the course of the project they have moved from having multiple tabs and screens open to just needing their phone at hand. Tweaking the team’s process is made very simple with JIRA Service Desk.
All in all the work took place over 12-18 months and ongoing maintenance and tweaks are made all the time.
They are now spreading this gospel other parts of the Activision Blizzard King group who are interested to using the solution.