The State of IT Automation: New Pressures Invite New Opportunities Read Report

3 Survival Strategies for Modern IT Challenges with Splunk ITSI & Resolve

3 Survival Strategies for Modern IT Challenges with Splunk ITSI & Resolve
September 6, 2018 • Resolve Staffer

Network operations and IT operations teams contend with a variety of challenges today. As many organizations undergo digital transformation, network and IT operations are expected to achieve the highest possible efficiency while maintaining increasingly complex networks and infrastructures. Add to this the growing number of technologies IT and network operations need to manage—hybrid cloud, containers, intent-based networks, virtualized storage—and you have a real headache.

Most IT operations teams and Network Operations Centers (NOCs) rely on an event management platform, such as Splunk ITSI, IBM Netcool, CA Spectrum, Micro Focus (HP) Operations Manager, or BMC TrueSight, to help the team track, normalize, and analyze events throughout the infrastructure. However, it seems there are always more events to handle than the team can reasonably get to in a day, and validating those events gets harder as the infrastructure gets more complex. To survive, NOCs and IT operations teams need to consolidate, automate, and left shift event-related work.

Consolidate

Let’s take Splunk ITSI as a promising place to host your team’s alert workflow, as it integrates Splunk Enterprise’s log analytics with event monitoring and management capabilities. Leveraging Splunk’s correlation search abilities, you can identify patterns across multiple data sources and generate a Notable Event (aka alert) when your search results meet specific conditions—thus reducing what used to be many events that share the same cause into a single, useful alert. Once your events are consolidated, consider how you can do the same for your team’s workflow: look at tools you can integrate right into Splunk ITSI.

Integrated with Splunk ITSI? See how Resolve and Splunk create a dynamic duo to address notable events.

Automate

When IT and network operations leaders are hesitant to automate, it’s usually because they’re concerned about spending a lot of effort but not seeing an increase in efficiency. Indeed, many have watched a big-bang automation initiative fizzle out with no results. That can happen if you try to fully automate the response to every Splunk ITSI alert—it doesn’t always work. Instead, consider how your team deals with an alert; the alert’s workflow has a lifecycle:

  1. Validation: Check whether the alert’s event is still happening now. If so, check whether the event is being caused by an expected routine activity, like system maintenance. If there’s no expected cause for the event, you’ve got an incident on your hands.
  2. Diagnosis: Investigate infrastructure, systems, and services to find out what’s causing the incident.
  3. Resolution: Fix the source of the incident.

First, focus on automating the Validation phase, which can save huge time for your team by doing things like checking for related tickets, verifying the issue is persisting, creating the ticket, and populating the ticket with system information that will be needed later—readily automatable stuff! You may be surprised by the significant efficiency gains your team sees from this alone. And if your automation integrates right into Splunk ITSI, the team can start diagnosis on the validated alert right in the same interface.

Left Shift

Once automated tasks are complete (as in the automated validation example above), it’s time to think about tools to help your team finish the tasks still left to do: incident diagnosis and resolution. Less experienced (i.e., Level 1) operators typically need:

  1. Guidance
  2. Timely information to facilitate decision-making
  3. Packaged commands

When you roll these three things together you have something we like to call “human-guided automation.” This approach helps more complex issues get diagnosed and resolved faster by automating some tasks and decisions while relying on humans for others. With all this help, issues that used to require a Level 2 or Level 3 agent can be handled by a Level 1, reducing escalations and improving the efficiency of the team.

So why do we call this left shifting? In most tiered teams, escalations look like this:

3-Survival-Strategies-Blog

Instead, as above, of pushing work to the right (to more experienced but expensive and overloaded team members), IT and network operations leaders want to shift the work left, down to the less experienced but more available team members, which helps the whole team get more done faster.

Find out more about how Resolve can help you execute these critical operations survival strategies when you have a Notable Event from Splunk ITSI.

Resolve-Staff

About the Author, Resolve Staffer:

This post was written by one of the awesome contributors on the Resolve team.

Recommended Reads

The Rise of the Cognitive NOC and the Role of IT Process Automation

The Rise of the Cognitive NOC and the Role of IT Process Automation

Find out how the Cognitive NOC has become the driving force in network management.

What Is the Network Operations Center (NOC): A Brief Overview

What Is the Network Operations Center (NOC): A Brief Overview

How to make your NOC performance reach its full potential.

Getting Out of the 2010s Era of Alarm Avalanches

Getting Out of the 2010s Era of Alarm Avalanches

Leverage a scalable approach to alarm management by allowing technology to do the work.