Whether you call it an incident, outage, surprise, or unplanned work, your application isn't working as expected, and you need to deal with the problem. Your aim during incident resolution is to minimize the impact and get services restored as quickly as possible.
I am defining an incident as an unplanned service interruption or reduction in quality. An incident can be:
- a delay in search queries being returned.
- communication issues between a third-party component and your service.
- an outage due to a cyber attack.
As consumers, we expect the SaaS applications and technologies we use to be always on. However, this isn't reality. As technologists, we often deal with surprises when our applications behave in unexpected and unwanted ways. The consequences of this unplanned work go beyond the impact on end-users. According to Unplanned Work: The Human Impact of an Always-on World report from PagerDuty, working on incidents:
- directly affects a company's ability to innovate.
- results in unhappy customers.
- impacts the well-being of employees.
The demand on digital services is increasing, and this demand can result in more incidents. Why? As more customers arrive, they bring increased traffic and unexpected use cases. Your systems may not be able to handle the traffic load or the shifting customer needs. Addressing those may require rapid feature releases. Whatever the reason for an incident, you need to resolve it rapidly.
Why is fixing incidents difficult?
Organizations need the right tools and processes in place to quickly identify and resolve incidents. Some organizations have automated many incident processes, such as communication, ticket creation, notifications, and chat channel creation. However, there is a need for greater automation in the number of organizations automating tasks involved in incident resolution and the types of tasks being automated. Surprisingly the PagerDuty study found that 90% of companies use little to no automation for technology issues.
An extra challenge organizations are facing today is the number of employees working remotely due to the pandemic. Earlier this year, Atlassian released a report on the State of Incident Management including a section on the pandemic's impact. The survey showed that while demand for services is increasing, 51% of respondents reported an increase in incident response times as people are now working remotely.
It's not necessarily that working remotely makes people less productive; it's that these aren't “normal” times. Employees are juggling other responsibilities such as caregivers, parents, partners, etc, while trying to focus on work. In March, a friend shared with me the concept of a “Pandemic Tax,” which they use in sprint planning to estimate how long a task will take. Consider the reduction in people's capacity in everything from sprint planning to estimates on incident resolution.
Improving incident management with feature flags
Kill switches or circuit breakers
If you're already using feature flags for release management, you likely have used a flag to disable a newly released feature that wasn't behaving as expected. Kill switches and circuit breakers allow you to minimize the blast radius when something goes awry with a feature without redeploying or restarting the service. While most commonly used during releases, they can also be used for long-term operational needs. For example, disabling capabilities when a backend system was disabled for a significant time.
Throttling
Sometimes you don't want to or can't completely disable a feature. Instead, you want to throttle or degrade service for a set of users. Say, you're having issues with too many requests to your API. You can customize how many requests each route can send. You can allow only 100 requests/minute for all customers, but if for contractual reasons, some customers need more you can make exceptions for those customers. Or, if a customer is sending too many requests, you can limit the number of requests for that customer, while you troubleshoot.
Reducing MTTR
The metric many organizations track when it comes to incident resolution is mean time to recovery (MTTR). But before you can repair what went wrong, you need to identify the problem. The majority of time spent in incident resolution is often figuring out what changed or triggered the incident.
Feature flag event data provides a valuable data point when troubleshooting an incident. When an alert is received, correlating telemetry data with flag event data can show you whether a feature was recently enabled or changed, providing insight into the incident. Seeing that a feature was enabled shortly before the incident occurred, you can use a flag to throttle traffic or disable it. Disabling a feature via a flag takes less time than rolling-back a deployment, helping you mitigate the impact of an incident.
Automation
We automate tasks to reduce toil. Consider the tasks you regularly perform when resolving an incident like adjusting logging levels, rate-limiting a service, disabling a service. Look for ways to automate these tasks as well. Save the cognitive load for resolving the problem, not addressing toil. Using feature flags during incident management can help you automate tasks and eliminate toil.
At LaunchDarkly, we have a flag that regulates how heavily we sample our telemetry data. When there's an incident, we tweak the flag to collect higher fidelity data for select services that may be relevant to the incident. Being able to adjust the sampling programmatically based on an incident helps us quickly gather relevant data.
Another way to automate data collection during incidents is to adjust logging levels when a notification is received. When an alert is triggered, you can programmatically set the logging level to ‘debug.' When the incident is marked as resolved, that triggers the flag to set the logging level back to ‘info.'
Update your processes and runbooks
You may not be aware of all the operational flags your application needs. If, after an incident, you say: “I wish we had a flag for that.” Create it. Not every incident should result in the creation of a feature flag, but if you identify this as a repeated issue during the post-incident review, create an operational flag.
As you create flags and automate incident management tasks, make sure to update your runbooks. Indicate where flags exist and when they are triggered automatically and when they require manual intervention.
Instead of waiting for an incident to determine where an operation flag is needed, conduct chaos experiments or game days. If you understand how your application performs when problems arise, you can proactively create any necessary flags to be prepared.
Conclusion
When an alert is triggered, the clock is already ticking. With people and architecture distributed, incidents can occur more frequently and be harder to coordinate. You can make things a little easier by using technology you already have in place with feature flags to improve how you respond to manage incidents.