Solving challenges in network and IT operations is something of a constant mantra. One consistent thread that has remained constant throughout the years is to “do more with less.” Obviously, the key to scaling IT operations is through automation – which makes repetitive manual tasks virtually obsolete. The problem is that on the surface automation does not look all that difficult given that people have been scripting since the invention of the computer. But when you start tackling something of more substance, you quickly find that it is taking much longer than expected, and perhaps it’s at this point that you begin to wish that you hadn’t started the project in the first place.
There is a huge, largely untapped opportunity to help automate incident diagnostic and resolution in network and IT operations, which is currently heavily dependent on using IT personnel, and agents, to address events and alerts from the IT infrastructure, as well as incidents reported by customers.
For many operation centers, the number of incidents can be so overwhelming that the operational support team often cannot keep up with demand. Automating the diagnostic and repair processes helps, and there are many tools out there such as HP OO, BMC AO, CA PAM, as well as the RESOLVE Software System, that simplifies matters somewhat but what you soon find out is that once you’ve moved beyond the simple processes, it can get much tougher very quickly. Instead of taking two weeks to build automation, it’s now taking two months. Instead of just requiring a simple “script”, it has now turned into a full-blown “custom application” to automate a process.
The nature of automation is that to be able to automate well, you really need to understand all the steps in the process in detail, including all of the different conditional possibilities, exceptions that can happen and how to deal with them. As the process gets bigger and more complicated, the automation gets exponentially harder to get done. However, as it applies to incident resolution, instead of looking at automation as something that runs in a blackbox behind the scene, or what I refer to as “closed-loop” automation, taking a human or engineer-driven approach with interactive automations that specifically targets manual or repetitive sub-steps of a complex process, really changes the dynamics. Instead of trying to “boil the ocean” by completely automating the process from end-to-end, selectively automating just a subset of repetitive tasks makes building automations considerably easier. In essence, you leave all of the complex reasoning to the support engineer to decide and just give him or her the contextual details that they need to make decisions through automations.
RESOLVE accomplishes this by combining the process guidance and workflow, such as remediation procedures and decision trees, directly with automations. For example, when users navigate from an alert in an event management system or from a ticket in an incident management system to RESOLVE in order to repair the problem, they immediately see the set of instructions with the contextual details that were retrieved by the automation as the problem was reported. Additional automations embedded directly within the procedure can be manually triggered by the support engineer to retrieve further diagnostic details or initiate a fix as needed. The main point here is that it’s much easier to build these simpler “partial” automations and bind them together with people and process guidance than trying to “boil-the-ocean” from end-to-end. This allows many problems that typically are escalated to Level 2 and Level 3 support to be shifted to Level 1 engineers to improve MTTR and first call resolution. There are tremendous benefits to be had when you combine knowledge and automation, but that’s a topic for another post.
Automating network health checks & diagnostics accelerates service restoration during severe weather