Network operations and IT operations teams contend with a variety of challenges today. As many organizations undergo digital transformation, network and IT operations are expected to achieve the highest possible efficiency while maintaining increasingly complex networks and infrastructures. Add to this the growing number of technologies IT and network operations need to manage—hybrid cloud, containers, intent-based networks, virtualized storage—and you have a real headache.
Most IT operations teams and Network Operations Centers (NOCs) rely on an event management platform, such as Splunk ITSI, IBM Netcool, CA Spectrum, Micro Focus (HP) Operations Manager, or BMC TrueSight, to help the team track, normalize, and analyze events throughout the infrastructure. However, it seems there are always more events to handle than the team can reasonably get to in a day, and validating those events gets harder as the infrastructure gets more complex. To survive, NOCs and IT operations teams need to consolidate, automate, and left shift event-related work.
Let’s take Splunk ITSI as a promising place to host your team’s alert workflow, as it integrates Splunk Enterprise’s log analytics with event monitoring and management capabilities. Leveraging Splunk’s correlation search abilities, you can identify patterns across multiple data sources and generate a Notable Event (aka alert) when your search results meet specific conditions—thus reducing what used to be many events that share the same cause into a single, useful alert. Once your events are consolidated, consider how you can do the same for your team’s workflow: look at tools you can integrate right into Splunk ITSI.
When IT and network operations leaders are hesitant to automate, it’s usually because they’re concerned about spending a lot of effort but not seeing an increase in efficiency. Indeed, many have watched a big-bang automation initiative fizzle out with no results. That can happen if you try to fully automate the response to every Splunk ITSI alert—it doesn’t always work. Instead, consider how your team deals with an alert; the alert’s workflow has a lifecycle:
- Validation: Check whether the alert’s event is still happening now. If so, check whether the event is being caused by an expected routine activity, like system maintenance. If there’s no expected cause for the event, you’ve got an incident on your hands.
- Diagnosis: Investigate infrastructure, systems, and services to find out what’s causing the incident.
- Resolution: Fix the source of the incident.
First, focus on automating the Validation phase, which can save huge time for your team by doing things like checking for related tickets, verifying the issue is persisting, creating the ticket, and populating the ticket with system information that will be needed later—readily automatable stuff! You may be surprised by the significant efficiency gains your team sees from this alone. And if your automation integrates right into Splunk ITSI, the team can start diagnosis on the validated alert right in the same interface.
Once automated tasks are complete (as in the automated validation example above), it’s time to think about tools to help your team finish the tasks still left to do: incident diagnosis and resolution. Less experienced (i.e., Level 1) operators typically need:
- Timely information to facilitate decision-making
- Packaged commands
When you roll these three things together you have something we like to call “human-guided automation.” This approach helps more complex issues get diagnosed and resolved faster by automating some tasks and decisions while relying on humans for others. With all this help, issues that used to require a Level 2 or Level 3 agent can be handled by a Level 1, reducing escalations and improving the efficiency of the team.
So why do we call this left shifting? In most tiered teams, escalations look like this:
Instead, as above, of pushing work to the right (to more experienced but expensive and overloaded team members), IT and network operations leaders want to shift the work left, down to the less experienced but more available team members, which helps the whole team get more done faster.