Recently, we were surprised at the confusion of some organizations about the process of incident management, so we thought – why not to put a quick primer down on paper?
To be successful with incident management, first you need a process – a repeatable sequence of steps and procedures. Such a process may include four broad categories of steps: detection, diagnosis, repair, and recovery.
Identification Incident management begins with problem identification. This can be handled using different tools. For instance, infrastructure monitoring tools help identify specific resource utilization issues, such as disk space, memory, CPU, etc. End user experience tools can mimic user behavior and identify users’ POV problems such as response time and service availability. Last but not least, domain-specific tools enable detecting problems within specific environments or applications, such as a database or an ERP system.
On the other hand, users can help you detect unknown problems that are not reported by infrastructure or user behavior monitoring tools. The drawback with problem detection by users is that it usually happens late (the problem is already there), moreover the symptoms reported may lead you to point to the wrong direction.
So which method should you use? Depending on your environment, the usage of the combination of multiple methods and tools would be the best solution. Unfortunately, no single tool will enable detecting all problems.
Logging events will allow you to trace them at any point to improve your process. Properly logged incidents will help you investigate past trends and identify problems (repeating incidents from the same kind), as well as to investigate ownership taking and responsibility.
Classification of events lets you categorize data for reporting and analysis purposes, so you know whether an event relates to hardware, software, service, etc. It is recommended to have no more than 5 levels of classification; otherwise it can get very confusing. You can start the top level with something like Hardware / Software / Service, or Problem / Service request.
Prioritization lets you determine the order in which the events should be handled and how to assign your resources. Prioritization of events requires a longer discussion, but be aware that you need to consider impact, urgency, and risk. Consider the impact as critical when a large group of users are unable to use a specific service. Consider the urgency as high when the impacted service is of critical nature and any downtime is affecting the business itself. The third factor, the risk, should be considered when the incident has not yet occurred, but has a high potential to happen, for example, a scenario in which the data center’s temperature is quickly rising due to an air conditioning malfunction. The result of a crashing data center is countless services going down, so in this case the risk is enormous, and the incident should be handled at the highest priority.
Diagnosis is where you figure out the source of the problem and how it can be fixed. This stage includes investigation and escalation.
Investigation is probably one of the most difficult parts of the incident management process. In fact, some argue that when resolving IT problems, 80% of the time is spent on root cause analysis vs. 20% that is spent on problem fixing. With more straightforward problems, Runbook procedures may be very helpful to accelerate an investigation, as they outline troubleshooting steps in a methodical way.
Runbook tip: The most crucial part of the runbook is the troubleshooting steps. They should be written by an expert, and be detailed enough so every team member can follow them quickly. Write all your runbooks using the same format, and insist on using the same terms in all of them. New team members who are not familiar yet with every system will be able to navigate through the troubleshooting steps much more easily.
Following the runbook can be very time consuming and lengthen the recovery time immensely. Instead, consider automating the diagnostic steps by using run book automation software. If you build the flow cleverly and weigh in all the steps that lead to a conclusion, automating the diagnostics process will give you quick answers, and help you decide what your next step is.
Escalation procedures are needed in cases when the incident needs to be resolved by a higher support level.
The repair step, well… it fixes the problem. This may sometimes involve a gradual process, where a temporary fix or workaround is implemented primarily to bring back a service quickly. An incident repair may involve anything from a service restart, a hardware replacement, or even a complex software code change. Note that fixing the current incident does not mean that the issue won’t recur, but more on that issue in the next step.
In this case too, straightforward repairs such as a service restart ,a disk cleanup and others can be automated.
The recovery phase involves two parts: closure and prevention.
Closure means handling any notifications previously sent to users about the problem or escalation alerts, where you are now notified about the problem resolution. Moreover closure also entails the final closure of the problems in your logging system.
Prevention relates to the activities you take, if possible, to prevent a single incident from occurring again in the future and therefore becoming a problem. Implement two important tools to help you in this task:
RCA process (Root Cause Analysis) The purpose of the RCA process is to investigate what was the root cause that led to the service downtime. It is important to mention that the RCA process should be performed by the service owners, who are not necessarily the ones who solved the specific incident. This is an additional reason why incident logging is so important – the information in the ticket is crucial for this investigation process.
And finally, Incident reports – while this report will not prevent the problem from occurring again, it will allow you to continually learn and improve your incident management process.
Automating network health checks & diagnostics accelerates service restoration during severe weather