At the risk of sounding like a broken record, we all know Network Operation Center’s (NOCs) and IT Ops are under constant pressure to deliver increased efficiency and productivity to manage an infrastructure while continuously increasing scale and complexity. Adding to this, the introduction of new DevOps processes and new technologies such as Containers, Network and Storage virtualization and Hybrid Cloud can paint an intimidating picture.
There is no secret sauce – you need to understand the strategic importance of various aspects of your environment and processes and make the necessary investments to make those areas as efficient and streamlined as possible.
In a typical IT and NOC environment, the central system that drives the human activity and workflow is the event management system such as IBM Netcool, CA Spectrum, HP Operations Manager. They function to consolidate and normalize the extremely large volume of events to a much smaller number in the hope that there is sufficient capacity, knowledge and experience in the operations team to be able to handle them. Over time, tools like Splunk Enterprise, performance management and other monitoring tools are added. Both feed additional alerts into the event management system, providing deeper insight and analytics into the applications, systems and networks that they monitor.
The fundamental challenge for IT and Network Ops is how to directly scale the capacity of the operations team to be able to respond to the increasing volume and complexity of alerts.
- How do you scale the operations team without adding additional headcount?
- How do you reduce the human workload and reduce the number of alerts that are simply not looked at due to limited team capacity?
- How do you minimize escalation to limited resources/subject matter experts?
- How do you enable less experienced technicians to be more productive and reduce the typical 6 to 9 months ramp up time?
- And ultimately, how do you reduce mean time to repair mean-time-to-repair (MTTR)?
The answer to these challenges, you need to do the following things well:
- Streamline – Consolidate your workflows, reduce the “swivel chair” between tools
- Automate – automate intelligently and strategically where you can
- Left Shift – Define clear instructions on how to handle alerts and engage a broader team to help with the workload and reduce escalation
Streamline Operations with Splunk ITSI + Resolve
Splunk IT Service Intelligence (Splunk ITSI) provides a fresh makeover to your 20 year old event management tools. What if instead of “normalizing” the rich event data into a common data model, you were able to keep all that information, drive business analytics and visual dashboards in real-time, while being able to streamline your workflow on a single platform?
Many organizations already use Splunk Enterprise to:
- Aggregate logs from a variety of sources
- Correlate events and send alerts into the event management system
- Technicians log into Splunk to triage the event logs in more depth
- Event management systems feed alerts and ticket data into Splunk for business intelligence and analytics
Splunk ITSI poses a strong case of whether this should all be a single integrated platform by avoiding the unnecessary friction going back and forth. Below we’ll explore the proposition in more detail, especially when we add Resolve into the mix to address where Splunk ITSI monitors and manages the alerts, while Resolve automates and accelerates the resolution to the alerts..
Automate! Automate! Automate!
The obvious key to scaling operations is automation. How you approach and adopt automation however, often can determine whether you spend a lot of cycles with limited results, or are able to quickly see significant ROI for your efforts. A good place to start is by examining your event and ticket data. Many IT operations are already feeding their event and ticket data into Splunk Enterprise, as it makes it easy to visualize and get answers to these types of questions:
- What types alerts do you get the most? How many are ticketed?
- Which alerts takes the most time to triage and remediate?
- Which alerts have higher priority? How much impact/value?
The analysis provides a starting point for identifying potential candidates for automation and discussion with subject matter experts. From our experience, only 10-15% of alerts can be automated fully from end-to-end. Automating remaining alerts should adopt the human-guided automation approach by integrating automation with process guidance and knowledge instructions as described in the following section.
The full automation candidates however, can significantly reduce the human workload and often fall into a category of automations we refer to as “event validation.” Example use cases:
- Check if there is a related ticket for this alert already? Don’t create another ticket, just update the ticket and the event with the appropriate references.
- Is the problem still happening? Check if it’s still a problem. Don’t create a ticket if it’s no longer a problem.
- Yes, its a problem? Create the ticket, automate gathering of contextual details that the technician is going to need and update the ticket/event.
A typical NOC at a Communications Service Provider (CSP) may receive several million events per day. Event validation automations is a good place to start and can provide significant reduction in the number of actionable alerts and ROI.
Left-Shift with Human Guided Automation
The last element to scaling the operations team is enabling lower tiers or broader teams to take on more of the activities to triage and fix repetitive problems, which reduces escalation to higher tiers with limited resources. This involves providing clear process guidance, instructions and decision support, automation that enables the technicians to carry out the diagnostic and repair tasks without requiring direct access or training necessary to perform the command.
Unlike traditional knowledge management or wiki, Resolve’s approach to process guidance unifies or embeds the automation directly within the process instructions rather than requiring users to switch to a separate automation tool. This concept, human-guided automation transforms how automation can be utilized to accelerate complex processes with decision and control being made by a technician versus the risk to develop a full, end-to-end automation with the many possible unknowns.
Resolve Human-Guided Automation provides the following benefits:
- Managed alerts (Splunk notable events) have clearly defined processes and instructions to triage and remediate the problem
- Process instructions include automation results that provides the contextual details the technicians use to make decisions
- Process instructions embed automations to execute further diagnostics and repair
- Allows complex processes to be accelerated by automating parts of the process that are good candidates for automation while relying on humans for the critical decision points
Unlike traditional knowledge management or wikis, Resolve process guidance is “actionable” versus an optional reference material that is not used. Resolve engages the technicians as the embedded automation saves them time and enables them to carry out tasks that previous would have to be escalated while ensuring a consistent and streamlined process is followed.
How Does Splunk ITSI + Resolve Work?
Resolve and Splunk ITSI Workflow
- Events are sent from many sources into Splunk ITSI
- Splunk correlates the data and identify “Notable Events” that require additional analysis. Typically these would be manually evaluated by a technician.
- Notable Events are pushed to Resolve, which triggers an automation to validate the event, auto-ticket, data gather contextual information and initial diagnostic tests on end systems and devices
- When the technicians examine the event, instead of simply viewing only the event details, they are presented with Resolve dashboard embedded directly within Splunk ITSI. The dashboard displays results from the triggered automation, as well as the specific process guidance and step-by-step instructions on how to further analyze and repair the problem. These may include:
- Stepping through guided decision trees
- Executing additional diagnostic tests with assessed results
- Executing an automated fix to repair the problem
- Updating the Splunk Notable events and ticket with captured detailed activities
- For complex problems that need to be escalated, subject matter experts can navigate directly from the ticket or the Notable Event to the Resolve Resolution Record detailing all the activities that were taken, including:
- Details of all the automations that were executed
- Decision tree selection that were made and duration by users
- Messaging collaboration and work notes
- Finally, the analytics derived from the event data (from Splunk), as well as the process execution and automation data (from Resolve), help provide guidance on where the operations team needs to focus next— what events to automate, what process to build out, what tasks to automate
Watch the video and see Resolve in action with Splunk ITSI.
Splunk ITSI Demo
- Download the Resolve Add-on for Splunk IT Service Intelligence for Splunk® Enterprise for Resolve Systems in Splunkbase, the Splunk app store.
- Download the Resolve Add-on for Splunk IT Service Intelligence