Continuing from the previous blog post, as Splunk IT Service Intelligence (Splunk ITSI) and Resolve look to displace traditional event management solutions such as IBM Netcool, CA Spectrum, HP and Operations Manager, there is an opportunity for transformational change to how events are managed, specifically the traditional event/alert acceptance process.
With traditional Event Management, when a new system or device is introduced into the IT infrastructure, a common process involves the inclusion of new alerts that will be sent to the operations team. Ideally, this would involve explicitly defining and documenting how the alert is to be “handled”, how to diagnose, and then how to remediate the alert. However, a large number of organizations often simply default to just mapping how the alert should be “normalized” into the field structure, supported by the event management system, the severity, etc. The response process and automation are not considered because there is no platform available to address this gap, until now—with Splunk ITSI + Resolve.
What’s Makes Splunk ITSI So Cool?
Firstly, the ability for Splunk ITSI to keep all the fields of the events it receives is pretty damn cool. Operations no longer have to spend a bunch of time managing “rules” files mapping events fields. Splunk ITSI also introduces the notion of Notable Events which forces the operations team to think about what events they want to pay attention to and take action, and possibly correlate against multiple events. Additionally with Resolve, operations now have a platform upon which to define how the Notable Events should be “handled.” The explicit set of instructions and decision points that are required to assess, diagnose and repair the incident (all of which are enriched with automation) provide results that are triggered when events are received and/or manually initiated by the operator directly from the process guidance.
The notation of notable events, coupled with remediation process guidance, and automation provides a huge opportunity to transform the traditional Event/Alert Acceptance Process to one that is focused on real actionable events. In addition, provide the necessary intelligence to respond and automate, both of which are critical to help accelerate and increase the productivity of the operations team.
Why is this important?
Let’s look at a mid-tier Communications Service Provider (CSP). They can easily receive two million events per day, which can be correlated down to 100,000-200,000 notable events. Further enrichment, by automation can be performed to determine whether:
- The event has already been ticketed, i.e.) update and don’t create a new ticket
- The problem still exists, i.e.) route flapping
- It is a valid incident, i.e.) create the ticket and gather some contextual information, diagnostic tests
In some cases, the incidents can be automatically resolved through automation. If not, the operators can directly see within the Splunk ITSI console the Resolve process guidance, which provides the necessary instructions and embedded automations available to further troubleshoot and repair problem. This ability to streamline the remediation process through process guidance and automation is key to scaling the operations team and continued support for new complex technologies without just simply adding expensive headcount.
Notable Events Review
Resolve Systems has experience accelerating incident response for IT Operations and Security Operations globally. We have innovated several key technologies essential to bringing automation to operational teams. One specifically, that is relevant to the Event/Alert Acceptance Process is, “Resolution Routing.” This allows you to manage the mapping between a large number of events and automation/process guidance versus on an individual ad-hoc basis. More importantly, it provides data points, together with the ticket data and event data, all of which can be fed into Splunk analytics, visualization and reporting to allow you to ask interesting questions such as:
- What % of events are managed versus unmanaged i.e.) have response process and/or automation defined?
- What % of events are automated and partially automated?
- What % of events are escalated to human operators?
- What % of events are ticketed?
- What are top managed events that currently do not have automation?
- What are top unmanaged events that should have remediation process and/or automation defined?
- What are top process guidances used most frequently?
- What are the top automations used most frequently?
This business intelligence is the key to determine what areas should be prioritized and further invested to accelerate the effectiveness of the operations team.
At Resolve Systems we have been exploring the application of machine learning, leveraging the analytics data to recommend and/or automatically configure and deploy automations for unmanaged events that have a high likelihood of applicability. By learning the Resolution Routing mapping between events and automations, analytics data from alert types, device class, vendors, etc… across multiple customers, we are able to determine whether automations could be be applied to specific unmanaged events, further drive the adoption of automation.
- Download the Resolve Add-on for Splunk IT Service Intelligence for Splunk® Enterprise for Resolve Systems in Splunkbase, the Splunk app store.
- Download the Resolve Add-on for Splunk IT Service Intelligence