Large network and IT operation centers handle hundreds to thousands of incidents daily. Many of these incidents even impact the quality of service offered to customers and affect revenue and customer loyalty. It is imperative for these operations centers to have a robust software tools strategy to quickly and cost effectively handle the resolution of these incidents. Despite trying numerous approaches like knowledge management and automation, these operations centers struggle to find a satisfactory approach to incident resolution.
L1 agents on the frontline resolving these incidents don’t have access the precise, context-specific guidance that’s needed to effectively solve the problem. They are forced to search for “hits” in the knowledge repositories that are designed to provide some directional guidance at best–not precise and tested prescriptive procedures for the specific context of the incident that the L1 agents need. They have therefore become escalation points to Level 2 engineers or field dispatch, significantly increasing costs and time to repair. In addition, these knowledge management systems have no mechanism to reduce tedious time-consuming steps or reduce scope for manual errors.
Automation software such as IT Process Automation tools from vendors like BMC, CA, HP and IPSoft can at best be applied to scenarios where closed loop automations can be defined and developed to diagnose and repair a problem. However, it is estimated that only 15% of incidents can be handled using this closed loop automation approach. The remaining 85% need some manual intervention that the automation tools do not provide.
These limitations of knowledge management or closed loop automation tools means that operation centers are forced to operate under a highly inefficient and costly incident resolution strategy with an extremely large army of L1 agents to manage the scale. What is needed is a solution with multi-faceted capabilities that work in complete synergy, making automation and manual resolutions effective. What should these capabilities look like? Let’s take a look.
Automation is the most powerful tool to accelerate the resolution process and reduce cost and error rates. Traditional approaches have looked at automation as all or nothing –either automation is applied to the resolution process from validation to diagnostics and resolution, or does not play a role in incident resolution. This does not have to be true. Automation can be used to automate parts of a manual process. The results of these automated sub-steps can then be contextually fed into manual-resolution guidance to drive a harmonious interplay between automated and manual sub-processes. This approach can not only cut time and error rate from manual processes, it can also allow businesses to invest in automation for any resolution process. As a plus, they can resolve incidents at the pace they deem most fit and not be blocked by development resources and budgets. For example, the validation of the incident can be automated in phase 1 and the diagnostics and repair can be automated in later phases.
Most automation vendors have failed to make automation drop dead easy to develop and have not considered the fact that operation centers have limited dedicated developer resources. It should be possible for non-professional developers like L2 agents or Subject Matter Experts (SME) to be able to codify procedures into automated steps without advanced coding skills. This becomes all the more important as incident resolution as any bottleneck to automation roll-out makes the solution ineffective.
2. Process Guidance
The first line of incident resolution when a human is required is the L1 agent. For resolution strategies to be effective, these L1 agents need to be able to quickly validate, diagnose and repair the problem. Most times, these L1 agents are less experienced and do not have the skill or knowledge to close the incident without precise and context-based guidance and support. The resolution strategy needs to accommodate this reality. Tools need to give L1 agents step-by-step and prescriptive guidance, hiding the underlying complexities, while simplifying the interaction with the tool. More importantly, the guidance tool needs to be able to leverage the power of automation at every possible opportunity (e.g. creating and updating tickets, validating the incident, gathering diagnostic information etc.) to accelerate the resolution process and remove scope for errors that L1 agents can inadvertently introduce.
The lifecycle of incident resolution is a team activity rather than an individual one. L1 agents need access to the right subject matter experts and contextual guidance to resolve each incident. When gaps in procedures or automations are found by L1 agents, they need to be able to seamlessly escalate to procedure / automation development teams to fill the gaps. When new automations / procedures are developed, key stakeholders and users need to be notified. L1 agents and SMEs need to have the ability to rate or comment on procedures and automations, or start a discussion thread on a specific ticket. All these collaborative activities need to happen seamlessly within the lifecycle of an incident resolution.
Large scale of incidents has created the need to have deeper insight into the nature of incidents in order to devise a pragmatic strategy for resolution. Some key questions are: Which incidents occur most frequently or have the highest business impact? What are the resolution steps that agents are following to solve an incident and which steps are causing the most delay? Who are the most productive agents, and are they effectively collaborating? These are key questions that need to be easily answered to devise a strategy spanning automation and manual processes. The incident resolution tool needs to make such insight part and parcel of the core resolution capabilities. Automations can be developed for incidents with the strongest business impact. Guided procedures can be tightened in cases where L1 agents are consistently getting stuck.
5. Process Improvement
One of the most important reasons for the failure of the incumbent approaches to incident resolution is their non-maintainability in the long haul. Procedures and automations are not maintained and fall stale. Gaps begin to emerge in the automation and procedure library and there is not a clear way to quickly identify these gaps and add new procedures and automations to the library in an agile fashion. A major contributor to this adverse situation is that consumers of the procedures and automations are not empowered to create and maintain content. Dedicated content and automation development teams, separate from the users, are created and over time a big chasm develops between the processes of the two groups. With the lack of a clear content maintenance strategy these systems lose credibility and agents simply stop using them, leading to their ultimate demise. Any successful incident resolution system needs a maintenance model that is seamless and integrates with the use of the tool.
Resolve is the only software tool in the market that fully integrates all the core capabilities – automation, process guidance, collaboration, insight and process improvement – to provide the market leading resolution solution. It is not just the individual capabilities but how these key pillars support and complement each other that provides the magnified value to customers. Some highlights of Resolve:
Automating network health checks & diagnostics accelerates service restoration during severe weather