Continuing from the previous blog post, as Splunk IT Service Intelligence (Splunk ITSI) and Resolve look to displace traditional event management solutions such as IBM Netcool, CA Spectrum, HP and Operations Manager, there is an opportunity for transformational change to how events are managed, specifically the traditional event/alert acceptance process.
With traditional Event Management, when a new system or device is introduced into the IT infrastructure, a common process involves the inclusion of new alerts that will be sent to the operations team. Ideally, this would involve explicitly defining and documenting how the alert is to be “handled”, how to diagnose, and then how to remediate the alert. However, a large number of organizations often simply default to just mapping how the alert should be “normalized” into the field structure, supported by the event management system, the severity, etc. The response process and automation are not considered because there is no platform available to address this gap, until now—with Splunk ITSI + Resolve.
Firstly, the ability for Splunk ITSI to keep all the fields of the events it receives is pretty damn cool. Operations no longer have to spend a bunch of time managing “rules” files mapping events fields. Splunk ITSI also introduces the notion of Notable Events which forces the operations team to think about what events they want to pay attention to and take action, and possibly correlate against multiple events. Additionally with Resolve, operations now have a platform upon which to define how the Notable Events should be “handled.” The explicit set of instructions and decision points that are required to assess, diagnose and repair the incident (all of which are enriched with automation) provide results that are triggered when events are received and/or manually initiated by the operator directly from the process guidance.
The notation of notable events, coupled with remediation process guidance, and automation provides a huge opportunity to transform the traditional Event/Alert Acceptance Process to one that is focused on real actionable events. In addition, provide the necessary intelligence to respond and automate, both of which are critical to help accelerate and increase the productivity of the operations team.
Let’s look at a mid-tier Communications Service Provider (CSP). They can easily receive two million events per day, which can be correlated down to 100,000-200,000 notable events. Further enrichment, by automation can be performed to determine whether:
In some cases, the incidents can be automatically resolved through automation. If not, the operators can directly see within the Splunk ITSI console the Resolve process guidance, which provides the necessary instructions and embedded automations available to further troubleshoot and repair problem. This ability to streamline the remediation process through process guidance and automation is key to scaling the operations team and continued support for new complex technologies without just simply adding expensive headcount.
Notable Events Review
Resolve Systems has experience accelerating incident response for IT Operations and Security Operations globally. We have innovated several key technologies essential to bringing automation to operational teams. One specifically, that is relevant to the Event/Alert Acceptance Process is, “Resolution Routing.” This allows you to manage the mapping between a large number of events and automation/process guidance versus on an individual ad-hoc basis. More importantly, it provides data points, together with the ticket data and event data, all of which can be fed into Splunk analytics, visualization and reporting to allow you to ask interesting questions such as:
This business intelligence is the key to determine what areas should be prioritized and further invested to accelerate the effectiveness of the operations team.
At Resolve Systems we have been exploring the application of machine learning, leveraging the analytics data to recommend and/or automatically configure and deploy automations for unmanaged events that have a high likelihood of applicability. By learning the Resolution Routing mapping between events and automations, analytics data from alert types, device class, vendors, etc… across multiple customers, we are able to determine whether automations could be be applied to specific unmanaged events, further drive the adoption of automation.
Automating network health checks & diagnostics accelerates service restoration during severe weather