Events and incidents are addressed by Operations Centers every day, often in the order of thousands daily. Events are typically monitored by event management systems, such as IBM Netcool, and incidents are typically monitored through ticketing systems, such as BMC Remedy. The resolution process for these events/incidents is largely manual today, with skilled engineers validating the event/incident, determining the root cause, running a variety of diagnostic procedures and then repairing the systems involved. Once the incident is resolved, the ticketing systems are updated with resolution details.
Although there are some content management systems, such as Sharepoint, that have the ability to record general resolution procedures and informal libraries of scripts and automations used by the lower level agents, the process is largely manual today with a large dependency on “tribal knowledge” and capabilities of skilled engineers. This is, however, not an approach that works in large operations centers, especially those with a large footprint of events and incidents. The cost of resolution is prohibitively high, customer satisfaction is negatively impacted due to high resolution times, which leads to lost revenue.
Whether companies are leveraging event, incident, and/or content management systems, the below challenges remain:
At Resolve, we’ve worked with many large organizations to develop the best practices and fundamental capabilities required for event management and incident management. These capabilities are what’s required to address the many pain points and problems that organizations face when they are trying to optimize their event management and incident management processes:
1. Process Guidance
Front-line support staff should have access to resolution procedures that are tailor-made for the current incident context. L1 agents are the first line of resolution and the role is intended to be filled by less trained and less technical staff in most organizations. In the absence of a tool that provides them with clear resolution procedures in the context of the issue, L1 agents become escalation points to L2 and Field engineers, resulting in a spike in cost of resolution as well as inordinate delays in the resolution process. Also, as new services are added, the operations centers and the L1 agents are forced to manage more events and incidents, with no budgets to add more staff. The process guidance capabilities in an incident resolution system become front and center.
The ability to automate some or all of a procedure is essential to dramatically improving resolution time. It also reduces the possibility of errors that can come from manual procedures. To leverage the power of automation effectively, the resolution tool should provide capabilities to automate not just end-to-end procedures, but also be able to automate sub-tasks in the manual resolution procedures. For example, diagnostics can be automatically run, with the results identifying next steps for an L1 agent. or triggering another automation to update the ticket automatically).
Large enterprise companies and service providers have diverse and heterogeneous environments of applications, ITSM tools, network elements, etc. Any approach to incident resolution needs to have an intrinsic ability to easily connect to a variety of systems through the interfaces available and process queries and responses from these systems with ease.
4. Continuous Improvement
As service evolves, resolution procedures need to be created for new events and incidents. Also, procedures connected to existing events and incidents need to be continuously updated and improved. This continuous improvement process needs to be baked into the main resolution process to be sustainable. Improvement efforts that are ad-hoc and disconnected from the overall incident resolution process have proven to be unsustainable.
Leaving any one of these key capabilities out of the mix leads to less-than ideal results and wasted dollars invested. Schedule a demo of RESOLVE today to see this game-changing system LIVE.
Automating network health checks & diagnostics accelerates service restoration during severe weather