Resolve Systems Earns SOC 2 Type II Compliance. Read the press release ›

Scaling Up to Keep Costs Down: Automation for Web Application Incident Management  

Written By Derek Pascarella
Aug 8, 2023

Any organization that’s keeping up with today’s sharp rise in business demands (or better yet, getting ahead of the game) is doing so by getting innovative and jumping at the chance to do things differently.  

They’re not relying on the old ways or trying to use their existing toolbox. Instead, organizations are looking to the newest technologies and means of adding efficiency to as many day-to-day functions as possible.  

IT is a critical department that’s especially feeling the heat. IT teams are expected to do a number of tasks so high that the to-do list might be unreasonable, if not impossible at times. Want proof? Take a look at IT incident response – a primary area of responsibility for these teams that in all reality, cannot be done by human hands alone, no matter how many folks are on deck.  

In our recent blog, we talked about automation’s impact on IT incident response, specifically when remediating low disk space. An unmanageable amount of data processing requires the help of analytics, machine learning (ML), and more.  

Not too far from the tree comes another type of incident response use case for automation: web application down. 

Mitigating the Risk of Costly Downtime: Only One Way Out 

Downtime is expensive – there’s no doubt about that. In fact, it runs about $5,600 per minute (closer to $9,000 for a large organization) that digital service is down, according to a Gartner study from 2014. Depending on the size of the company, downtime can fall between $145,000 – $450,000 per hour, depending on business and environment characteristics like vertical, risk tolerance, and others.  

In the past, having good incident resolution meant having a skilled workforce that followed a standard procedure. Perhaps a crew of 20 or so would work 24/7 shifts to make sure digital services stayed up and running properly. And just maybe, those days seem like a nightmare to IT professionals who ran shop during those times, even with well-defined IT procedures and tested playbooks.  

But now, a 24/7 human team just doesn’t scale, and a key priority is figuring out how to mitigate the hefty downtime cost. After all, you can’t completely avoid some degree (even a tiny one) of human error.  

BLOG: Top 3 Ways IT Automation Drives Certainty in 2023’s Times of Economic Uncertainty 

The potential of losing just one minute is something many businesses can’t afford. No matter what steps are taken, organizations must change their approach toward these IT procedures. Automation is the only way to make it work: scaling for the business and finding a way to do so.  

Speed is the name of the game. How soon can an issue be diagnosed, and how soon can it be fixed? 

4 Metrics for Measuring Incident Management Effectiveness 

On the business side of things, there are a few key KPIs IT teams should focus on and track to make sure their incident management is successful.  

System Uptime and SLAs: An important item for the business, measuring system uptime – the service levels and the quality – indicates performance. From an application and business perspective, system uptime is one of the most important KPIs.  

Meantime to Identify (MTTI): Before an organization can repair an issue, it must identify the problem, which means “putting a label” on what’s wrong is a highly important part of incident management. Besides, a question can’t be answered before knowing what, exactly, is being asked.  

Meantime to Resolution MTTR: Mitigating the problem can occur once the problem has been detected and named. This metric is critical because it indicates how well your organization troubleshoots a problem and analyzes its root cause.  

Cost and Efficiency: Organizations want to see a reduction in MTTR overtime, along with a decrease in incidents and tickets. A drop in the volume of work being handled manually, as well as a push for automation will most likely result in fewer incidents and much more efficiency, and as time goes on, the dollar amount will decrease, too.  

READ MORE: Why Automating IT Incident Response Matters for Financial Institutions  

Automating tasks that were once manual opens doors for new productivity – and doing more with less. Organizations can gain a tangible way to think about improvement and create a strategy to reach new goals.  

The SOPs Most Suitable for Automation  

A very important term in the world of automation, a standard operating procedure (SOP) is a set of step-by-step instructions created by organizations that oftentimes belong on the ball field of IT. SOPs are the tasks, procedures, and parts of a checklist that are carried out on a regular basis.  

An SOP aims to create efficiency, reduce failures (like those associated with human error), and draw up a prediction of a desired output. Organizations will benefit from unifying elements of their SOPs and at the end of the day, will be set up for success in meeting company- and industry-specific compliance rules and regulations. SOPs must be defined in order to have the ability to create automation.  

It’s infeasible to automate procedures that are not set standards that are regularly tested. Knowing that automation can amplify a lot of things, applying it to an SOP overtime can be of great benefit to today’s businesses. On the other hand, it’s important to avoid automating those procedures that might be a bit out of the box. After all, simply put, automating chaos will just make it more chaotic.  

Organizations must keep certain SOPs in place, but also remember that anything done manually (like creating SOPs in the first place) doesn’t necessarily always work as a procedure done automatically. It’s important to be careful when taking a specific set of SOPs, testing, and designing them inside an automated solution, and to make sure it works well in an automated fashion – the way in which the organization originally intended.  

Remediating a Web App Down, In Real Life  

Let’s say an alert comes through stating that one of your organization’s on-premises web hosting applications is unresponsive.  

Resolve can intercept this alert in real time, and early, rule out the possibility of a false positive.  

The generation of the outage starts a process of bi-directionally posting updates to an IT operations monitoring system about each task the automation is performing along the way, by integrating with the system. Then, the ticket is created with pertinent information it learns from the IT operations monitoring system, like the alert ID, the affected host, the type of alert, and even display of the proper configuration item.  

This collection of information allows for confirmation that the application is in fact unresponsive, and for instance, the IIS isn’t running on the server. In that case, Resolve starts it up. The application pooling is also checked, and in this example, it’s actually in a recycled state from which Resolve brings it back online. The final step in this example is to check for a normal response from the web application. Upon confirmation, Resolve automatically closes out the ticket, adds some closure notes in the proper information field, and calls it a day!  

During just a few steps of diagnostics, Resolve is making sure two things are happening, automatically: tracking and documenting the information using a ticketing tool, as well as ultimately, reaching a resolution.  

Dig deeper into remediation of web application outages by watching this brief LinkedIn Live video and product demo replay on Resolve’s YouTube Channel! 

Adding automation to the mix allows organizations to scale up as business demands put mounting pressure on IT teams. Especially when it comes to nearly a half-million dollars being lost after just one hour of downtime, it’s no longer a question of whether or not to automate. Organizations that rely on a human-based crew to manually address web applications being down simply won’t survive without the help of automation.  

Don’t let your IT team fall behind when a digital service drops. Request a demo to learn how to lead during web application outages.  

This blog is the sixth part of our “The 7 IT Automations for Highly Effective Organizations” series, with a new blog dropping every Tuesday this summer. Inspired by Stephen R. Covey’s bestseller, The 7 Habits of Highly Effective People, we believe the seven automations we write about will help transform IT and businesses for the better – sustaining lasting success through upgraded and improved capabilities.  


About the author, Derek Pascarella:

About the author, Derek Pascarella:

Global Director of Sales Engineering

Derek Pascarella, Senior Sales Engineer at Resolve Systems, is an experienced and well-rounded IT professional with a diverse technical skill-set, emphasizing problem-solving and group collaboration. His expertise, combined with strategic thinking, put him in an optimal position to execute a thorough, clear solution to problems. Derek is also seasoned in stepping outside of his role to work in and manage cross-functional initiatives.