Network Operations might never be the same.
But then again, why would anyone want it to be? The power of automation and orchestration can bring incredible value to the Network Operations Center (NOC), including the business-critical call to get proactive and ahead of the incidence response and management game. It’s more than a towering volume of events – it’s the complexities involved, too.
From legal and compliance issues, the number of technologies and systems companies support, and the “eyes on glass” on what comes from the volume of attacks that are actually false positives – there’s a lot to unpack in a vast landscape.
The business has to be constantly involved as well, as the NOC is a vital backbone of incident management and everything in between.
The pressure is always on.
Integrated, Automated Response: An Ideal Fit for Your Environment
Automated incident response obviously unlocks efficiencies like automated incident detection, notification, and resolution.
We know incident response is a complex process with many mission-critical touchpoints, such as those of production systems. The most common challenges include lack of reliable talent and to no surprise, the fact that its procedures and processes are complicated, which can create a misconception about being difficult to automate.
But the benefits of automation in incident response far outweigh the challenges NOCs face:
- With automation comes much-needed speed, as it can process events much faster than human NOC teams, which means an accelerated mean time to resolution (MTTR).
- Automation also performs tasks and processes consistently, whereas humans on separate working hours might do things differently according to different trainings they receive.
- There’s 24/7 monitoring and availability with automation, but a human can miss an alarm with just one glance away from their screen.
- Automation can scale according to events and customer expectations as needed, at the drop of a hat.
New and Improved (and Automated) Incident Management
NOC engineers today can lay the groundwork for greater success with automation, starting with diagnoses.
Diagnoses: You Have to Start Somewhere
When issues arise for NOCs, no matter if it’s for the first time or the 20th, they have to determine what’s going wrong within the process and what needs to be done to repair it, according to an existing process or flow chart – the first step (or few) of diagnostics. Rather than automating everything all at once, as many organizations think is the right move to make, these first steps serve as the perfect place to start.
It’s best to start with the knowledge you have and the problem you’re facing, and then build it into your initial automation. From there, assess the process and the value you’re getting to identify potential improvement. Basically, has it solved the problem enough?
Additionally, when you’re looking for the best automation starting point, it’s important to look at alarm volume, and balance it with severity. It can be as simple as getting logged into a device so an engineer can start running commands – this small start enables faster work than humans can complete. As more is learned along the way, more processes can be added to the automation cycle.
Return on investment (ROI) — the value of these automations — does come into play here, as business often measure ROI by the value it, or IT, receives. But for real incident response improvements, ROI should be measured according to value it provides to the NOC team, as well, to indicate whether automation can be taken a step further.
False Positives and Noise that Never Ends
Burnout is among the most common, and worst-case results of false positives for NOC teams. For instance, false positives make things very difficult for controllers operating the alarm screen to know what to do, and when, because of never-ending noise and alert fatigue. False positives increase the stress levels of controllers and reduce the NOC’s responsiveness.
Let’s say 40 alarms come in from a single alarm dump that no one can seem to stop, but in between them is one missed critical alarm that causes an outage that nobody knows about until it’s too late – when customers start calling in and you’re left to figure out why as soon as possible.
Caution: Automation in incident response must be handled with care to avoid problems like over-ticketing. Just because an alarm comes in, just giving out a ticket doesn’t guarantee a fix. It’s about being aware of alarm severity and volume.
For example, a human might notice an alarm 3-5 minutes after it’s submitted, but come to find out, that alarm is actually coming in, we’ll say, 25 times per hour but then clearing within one minute. That means 25 tickets per hour are being hit for one event and that’s certainly a far-from-ideal place to be. Instead, constraints around tickets are imperative for keeping the quantity under control and avoiding ticket fatigue.
RELATED BLOG: The NOC of the Future: What Businesses Must know Now
Organizations should not expect automation to solve the problem at hand for them, but instead, they should be thoughtful about the process as a whole. This could include starting to identify the types of false positives coming in, and then building health checks and validations to determine device availability and alarm history – actions that will likely reduce the noise.
There are quite a few tools, including AIOps, that can be paired with automation as another way to lower the sound volume so-to-speak, but the devices seen in today’s environments cannot reduce the number of events enough to prevent fatigue. The more processes that can be identified to auto-remediate, or include or leave out, from those that need attention add a great deal of value. Take events that are escalated to field agents, for example, and not wasting their time, nor failing to reduce your mean time to resolution (MTTR). Better case scenario? Getting the field agent on the way to an actual event and running diagnostics while they’re en route.
Is Orchestration Undervalued? Probably.
A lot of confusion clouds today’s telcos in terms of automation and orchestration, and the differences between them.
Know this: Automation pairs well with isolated events that happen. Orchestration takes these automations and puts them together in a flow. Both are of high importance to all telcos. More often than not, operators and technicians log into multiple devices to understand what’s happening. They look at several systems and various metrics – and this is where the orchestration magic happens. What about pinging to hundreds of devices, checking for config changes, or seeing recent maintenance activity?
Orchestration, as it takes a complex process and automates it, allows remote remediation and makes it easier to distill it down as much as possible before human intervention.
Teams eventually find their way to orchestration as it naturally becomes the “next thing” over time. Starting to think about orchestration as you evaluate your process brings a lot more value to your automation and orchestration journey – just by really looking into the things you also reach out to as part of that process. For example, the ability to look at your Connection Management Building Block (CMBB) and learn the last time a config was changed, as well as going to get that config from the device is certainly worthwhile, adding exponential value to the existing process you and your team are working through.
A Resolve poll in Sept. 2023 found that half of today’s companies are using a combination of manual and in-house scripts, and that a quarter are still relying on mostly manual ones. Using an automation and orchestration platform allows IT to use scripts they already have, and then add documentation and expand on them.
Sure, use your batch file that’s 10 years old and still valuable. But then get the win. Leverage it to complete steps before, and afterwards, to build the orchestration further. And then when it’s time to rewrite it, you can do so for only the reusable block to get an updated version that’s embedded into your existing process. It’s all about lowering the bar of entry to create a new automation when you “bring your own code.”
Self-healing: Not Just Jargon
Self-healing is made up of many, many levels. The first one might be easy to gloss over, which includes the processes you’ve already documented, and therefore, know the outcomes.
Newer developments, like large language models and AI, drive plenty of new initiatives and provide important value when it comes to self-healing. The emerging intent-based networking (IBN) concept is an example of self-healing, as it observes network patterns, capabilities, and volumes and makes decisions based on a company’s policy.
You can get visibility into opportunities to compare automations and see where you can grow from a self-healing perspective, to save time and take advantage of auto-remediation.
Over in the NOC, automation can complete processes like card failures or card switches. In these instances, you can see the event come in and diagnostics run automatically to determine a potential hardware failure. You can switch the card automatically, you can identify the failed card, and then you can dispatch a ticket to have the issue repaired from field operations. Last, you’d follow up on that dispatch to understand the current alarm state, as well as dispatch state, and then when the alarm is closed and the alarm is clear, you can finally close out the event.
The win comes by proactively preventing outages, automatically dispatching tickets, reducing wasted time, and setting card failures and card switches up for success in the future, along with maintaining overall network health. Risk mitigation is just an added bonus of automation’s ability to take care of events, which also benefits risk and compliance teams.
Be Proactive Today to Minimize Incident Impacts Tomorrow
If you’re still reacting to incidents, then you’re doing it all wrong.
Even if lack of time seems like a valid excuse, IT can’t let it block them from getting proactive. The NOC is a 24/7 function, and so is the pressure it faces. Simply put, NOC teams don’t have time to not be proactive and figure out how to do so.
In the real-world NOC, it can often be heard that they don’t have time to pull someone off for training because of all the moving pieces of incident management and handling alarms. But what if that person called off work?
It’d be critical to find a way, within your business, to level up your approach.
For example, mergers and acquisitions can easily multiply the number of network devices exponentially, that come from different configurations and different company standards. Service assurance technicians have to then understand and differentiate between devices and networks and their standards, as well as troubleshoot based on those standards. Let’s say in this case, none of the above were documented.
Decisions in this example need to produce a single standard that applies to every device, ensuring that troubleshooting runs consistently across them. A process can be put in place to proactively understand the config information within various software, and the provisions are aligned to a certain company standard. From there, necessary changes are identified and scheduled to be made at the most suitable time of day.
The more actions you take to get proactive, the more time you free up for your technicians to start thinking proactively, too. Sure, you’ll identify new work to do, and there will always be something up next, but starting to remediate and automate simple, noisy tasks will open up more time to think about the next step before you get to it.
Transforming your incident management with automation will work wonders for any NOC and prepare it for future success – no matter the pressure. For more, watch our on-demand webinar, “Incident Management Reinvented: 5 Ways to Pioneer NOC Success through Automation.”