In ancient times the task of threshing, sifting edible grains of wheat from chaff, was performed with an agricultural tool called a flail. Modern farmers today accomplish the same undertaking at significant scale using huge mechanized vehicles called combines. When it comes to threshing the real from the imagined about AIOps, many IT executive decision-makers literally feel like they’re flailing about. Everywhere they look, vast undulating fields of marketing buzz extend deep over the horizon making inflated claims about what AIOps can do for them. If only there was a metaphorical combine to sift facts from hype, expectations regarding the practical capabilities of AIOps would be more realistic, leaving the disappointment of unrealized objectives to flutter away like chaff in a light breeze.
Since no such apparatus exists, metaphorical or otherwise, we turn instead to John Gentry, CTO & Sr. VP of Virtana, a leading AIOps provider. John’s trenchant perspective on the current state of AIOps & its near-term prospects leaves listeners with a clear-eyed, pragmatic view of how this transformative technology is reshaping the data center. Along the way we’ll learn what differentiates General Purpose AIOps from Domain-Specific AIOps, how AIOps helped one major company grow their transaction volume 700% while only scaling their infrastructure 300% with no head count increase, and why geography might be the key to attracting data scientists who perform the critical work AIOps relies upon.
Guy Nadivi: Welcome, everyone. My name is Guy Nadivi, and I’m the host of Intelligent Automation Radio. Our guest on today’s episode is John Gentry, Chief Technology Officer and Senior Vice President at Virtana, a leading AIOps platform for digital transformation. The job of monitoring and managing IT operations continues to grow progressively more complex and increasingly more mission-critical as enterprises move more of their processes to the digital realm. And AIOps is the solution many senior IT professionals are counting on to help their organizations keep up. But can the application of AI, machine learning, data science, automation, and other technologies provide IT with what it needs? John Gentry is a forthright advocate for the innovation and positive disruption AIOps can bring to bear on enterprise infrastructures. So we’ve brought him onto the show today to help our audience understand how real AIOps is today and what it’s likely prospects are for the future. John, welcome to Intelligent Automation Radio.
John Gentry: Thank you very much, Guy. I appreciate you having me. It’s a pleasure to be here and I really look forward to sharing at least my experience with customers and the state of AIOps in the real world.
Guy Nadivi: And that’s exactly what we’re about to dive into. AIOps is very much a market in rapid evolution. Even the acronym, AIOps, itself represents something different today than it did when a Gartner analyst named Colin Fletcher coined it in 2016 and originally Colin meant the term to refer to Algorithmic IT Operations. But since then, it’s evolved to refer to Artificial Intelligence for IT Operations. John, for the people in our audience struggling to synthesize all the marketing messages they hear about AIOps, how do you define what AIOps actually does?
John Gentry: That’s a great question and certainly something that needs clarification and will probably continue to need clarification and refinement for quite some time. I know that it’s even evolved since the rebrand to artificial intelligence where now they’re categorizing two subcategories. One called General Purpose AIOps, which is applying data science and AI and machine learning to general sets of data for things like pattern recognition, and anomaly detection, maybe deduplication. And then they’ve entered a new term called Domain-Specific AIOps, which is really applied to just that, a specific domain, whether that’s a certain component of infrastructure operations or the application of AI to application performance monitoring. So I think it’s going to be continuing to refine, I think filtering out the noise. I often say to customers asking questions around how to effectively leverage it, that they peel back the onion. There’s a lot of marketing claiming that there’s artificial intelligence or machine learning, two very distinct things, being applied to their platforms. And if you look under the covers, there may be some fairly simple statistical regression or pattern matching, but there’s really no deep machine learning or other applications of advanced science. So when I look at how you define AIOps and what it actually does, I think you need to look at applying very specific mathematical or data science approaches to very specific sets of data that have context to reach purposeful or predictive outcomes. And so that can really introduce a wide spectrum, but I definitely recommend peel the onion back, look for the true application of various forms of data science to very specific sets of data to solve very defined problems.
Guy Nadivi: Okay, that’s the kind of specificity the market needs. Now speaking of the market, there have been some high-profile transactions in the AIOps space this year with ServiceNow acquiring Loom Systems, LogicMonitor acquiring Unomaly, and VMware acquiring Nyansa. Virtana has made a few acquisitions itself, but still remains an independent AIOps vendor. If an organization is considering AIOps for their environment, why go with an independent vendor rather than a larger software vendor who includes AIOps as part of a suite of offerings?
John Gentry: That’s again a very good question because there are definite trade-offs in the two approaches or in the selection of one versus the other. In some cases it may be a matter of actually leveraging both. When you look at some of these acquisitions, I think there was a realization in some of the traditional approaches that companies like ServiceNow or VMware have taken historically to IT operations that were organization of information and visualization of information. And they were potentially behind the curve in terms of natively developing AI as an intrinsic part of that process. Something unique about Virtana, we recognized because of the sheer scale of the data that we collect, the need to start applying data science and applying analytics back in the 2013 timeframe. Several years of experience, not only with understanding the data, but with applying the right data science to get to a specific outcome, I think those big providers, they’re going to be challenged to leverage or layer in the intellectual property they have acquired effectively. That would be a caution I would put around looking to a more of a generic solution provider that’s embedded AIOps versus an independent. At the same time, they do have one aspect of the approach that I fundamentally agree with, which is AIOps is one part of a larger puzzle in terms of driving towards things you mentioned in the opening like automation and automation & governance. So, no, I think even your independents need to start considering AIOps as one capability that they bring to bear, but not the definition of everything that they do. I think AIOps is really one aspect of the overall sort of continuum of IT operations when you start to look at everything from the rise of DevOps and the blurring of the lines between App Dev and Operations. And certainly when you look at the rise of public and now what’s presumably going to be hybrid cloud environments, and where can you effectively leverage AIOps to help with decision making around what’s the right place to place workloads? That’s a very different application typically of things like Monte Carlo simulation or advanced modeling versus AIOps with statistical regression analysis or pattern matching to find recurring problems and predict them and prevent them. I think you’ve really, going back to the first question, got to understand what am I solving for, and then am I best solving for that by looking at an additional capability of an existing vendor, like a ServiceNow or VMware, or do I need to find something more purpose-built for the specific challenge I’m trying to overcome, be that problem avoidance in a mission critical system or hybrid cloud transformation in a more bleeding edge sense.
Guy Nadivi: At the time of this podcast recording, the world finds itself in the midst of a global health emergency due to the coronavirus. So John, I think a lot of our listeners would be very interested to hear what contributing role can AIOps play in the current COVID-19 pandemic to help facilitate the sudden surge in people working from home?
John Gentry: Well, that’s one of the things I can say with some certainty is that certainly our solution, which understands seasonality and trends for customers like online trading and transaction processing platforms, most of what is normal, according to the machine learning, is no longer normal, right? So I think one of the easiest or most immediate applications was very quickly identifying just how significant a deviation from normal, the changing in work patterns actually represented, and being able to adjust or accommodate more quickly. I can tell you from personal experience, going back to that online trading company, the fact that they had years of leveraging AIOps and the application to understanding that the traffic patterns associated with things like market open and market close and aftermarket for trading and futures analysis meant that when those patterns all changed, they could very quickly identify where they needed to shift resources to accommodate new and different patterns. We actually reached out to them early in the pandemic when the markets were going significantly haywire, just to make sure they had everything under control. And they actually commented back that it was because of Virtana and the systems that were under management through that solution, that they weren’t having issues in that particular aspect of their business. Now, there were issues in other aspects just because it’s hard to manage everything at once, but that was encouraging to see them, real world application of what they knew about their business to accommodate very quickly identified changes. At the same time we had another customer that if you can believe it or not, a very traditional financial institution, was 100% in-office workforce. They had to very quickly transition to 100% remote. Being a financial institution they had to do so with as little disruption to their customers as possible. They actually used their understanding of predicted capacity forecasting to reallocate resources from traditional applications that were quite frankly no longer being used in any kind of meaningful way, from a volume perspective, and immediately redeploy those resources to support a VDI environment and scale that VDI environment to get their workforce back online. Actually managing the data center capacity or compute and storage capacity to deliver on that transformation was the easy part in their opinion. The harder part was the sheer shipping and logistics of getting all their employees set up with laptops or desktops and monitors and phone lines, but they managed to do it. Thousands of employees went from in the office to fully remote in 10 working days, which I think is a testament to the efficiency of their organization. Certainly it’s a lot more than just IT or AIOps. It’s a lot of really good people in process, but two very different examples. One ability to accommodate massive change by understanding it. The other, the ability to reallocate resources to new applications because of the level of visibility they had ongoing in their environment.
Guy Nadivi: Okay, let’s dive a bit deeper on specific results of AIOps. Can you provide some details on two or three of the most successful AIOps use cases you’ve been involved with? And if possible quantify the impact they’ve had.
John Gentry: Absolutely. I’ll actually talk about two specific ones that are really very different in terms of the business impact, in the very nature of moving or the promise of moving from traditionally reactive to ultimately proactive and even predictive operations. One is a fairly significant customer. They’ve been with us over nine years and their online transaction processing business, in that nine years, they’ve seen their transaction volume grow by somewhere on the order of 700%. You can imagine going from 24 million transactions or 24 million users to 400 million users over the course of nearly a decade. They’ve leveraged AIOps to understand how to gain efficiencies in both the infrastructure and the operations supporting that growth. And so while they’ve seen a 7x increase in their transaction volume, they’ve only had to scale their infrastructure by about 3x, right? So that’s just eking every bit of performance and capacity out of what they’re building to support their business. And that’s absolutely the result of the application of AIOps to understand those business patterns and cyclical patterns that mean I can move resources around versus just building excess capacity to accommodate a burst. And at the same time, they’ve been able to do that and maintain that growth with the same staff. They have the same number of people managing what is a 3x infrastructure and a 7x transaction volume as they did nine plus years ago. So when you think about their business, the ability to transact at a higher volume with less infrastructure, and to do so at ever increasing efficiency per head count, those both go directly to the bottom line. They are directly to the margin associated with every transaction. If it costs me less than infrastructure and it costs me less in people per transaction, that means that I’m affecting the bottom line from a cost perspective and driving profitability. So that’s one area, it’s like gleaning those efficiencies that promise of AIOps to automate the mundane and optimize resource utilization and move from reactive to proactive problem prevention so you don’t need a bunch of head count doing firefighting. And so that’s one example. Another example is actually a more recent customer, also a SaaS provider. Interestingly enough, we seem to have a strong market presence in software as a service likely because the software is the business and the infrastructure to support it is mission critical. But in this particular case, it was leveraging AIOps to drive competitive advantage and actually drive time to market. So this particular customer, they make their money through new feature, new capability introduction, right? They’re a large real estate property management firm. They’re the leader in SaaS-based property management. And when they introduce a new capability, like say risk assessment based on geography, or rent price elasticity analysis based on economics. That’s a new service that their customers can subscribe to and glean benefit from. What they actually did as a company, they are using Virtana in conjunction with one of our partners in the APM space to really drive a single point of view from early phases of application development all the way through production deployment to ongoing sustaining. And what it’s allowed them to do is one, eliminate any finger pointing between App Dev and IT Ops to say, “Well, you wrote bad code.” “No, you got bad infrastructure.” And say, “No, actually, we know exactly where the efficiencies are between the two,” but more importantly what they realized was they could actually build infrastructure that was more efficient at supporting the code, right? Or they could actually write code that better leveraged the underlying infrastructure. I think about my background in storage, moved to all flash, and the fact that all flash is really, really good at processing the very specific block size in terms of a request. And if you tune your query from your database to actually request that block size, you’re going to get a lot more performance for a lot less capacity in that environment. And so what they were able to do, was drive efficiencies between code and infrastructure, and that actually accelerated cycle time from Dev to DevOps, Dev Tests, to actually production release, to go live. And so if they can now take what used to be a 12 week process for building and releasing a feature and cut that to six weeks, they just gained six weeks of market time for that to be in market. That’s competitive advantage from a time to market. It’s competitive advantage from a differentiation and at the end of the day, because they’re doing it with more highly optimized code to infrastructure relationship, it’s competitive advantage in terms of profitability as well. I’m seeing applications of AIOps, we’ve done in a sophisticated manner. And again, it takes people in process that are behind it to get to those outcomes, but I’ve seen it drive cost down and profit up as well as really convert or transform that IT organization to a competitive advantage and differentiator for the business.
Guy Nadivi: John, one criticism I’ve heard about AIOps is that it tends to lack a risk analysis capability. In other words, it can tell you when an incident or issue needs your attention, and it can suggest to you what kind of attention that incident or issue needs, but it doesn’t provide a human operator any background about the risk of taking any of those suggested actions, perhaps with some historical context added about what broke the last time one of those particular actions was executed. One Gartner analyst told me that lots of AIOps vendors are claiming they can do this, but it only seems to work in the operating system called PowerPoint. How soon do you think we’ll start seeing risk analysis capabilities in AIOps?
John Gentry: That is a foundational question from my perspective, certainly given that the nature of our customer base being really blue chip and mission critical environments. I think I would answer that in a couple of ways. I mentioned before that reliance not just on technology, but on good people & process. And so I think there is an approach that I know I recommend and discuss with certainly my more conservative customers, this concept of automation with governance, right? And I think first you have to draw a distinction between AIOps as a learning problem identification and remediation platform where I can identify a problem. If I have a known resolution, I can make a recommendation and separating that from automating taking that action. Right? And I look at AIOps as being sort of the intelligence behind automation, the decision support, if you will. But then to introduce that change through a typical change management or typical governance frameworks. You mentioned ServiceNow earlier. That’s an obvious choice to say I’m going to identify the issue proactively and make a recommendation, push that recommendation along with the underlying root cause analysis to a change management request. And then I am going to introduce that human operator, that judgment call to approve that, to do a manual risk assessment and then execute. I think the next step from there, and certainly where we are in that journey, is having the closed loop visibility to say once I’ve executed that and I have an outcome that I’ve observed, and that outcome was as anticipated or what was predicted in the recommendation engine, as once I do that X number of times, pick your level of risk tolerance, go ahead and automate it, but still introduce or note the change in my change management system so I have an audit trail if for some reason the first 10 times it went great and the 11th it didn’t. That’s really where that machine learning underneath the surface becomes important. So I think AIOps or automation with governance is key. And I think it is going to take an ecosystem, not just a single provider, because again that domain specific knowledge is critical. I’m not going to make certain recommendations around things like firewall settings, certainly, because I’m not a security expert and my platform knows nothing about security, right? So I think that we really want to take a measured look. In terms of when it gets into actual product offerings, I do think it’s mostly PowerPoint today. I think it’s again, an application of AI in a certain environment. The one distinction I’ll make is starting to introduce something called … that is very much about measured risk. And so you mentioned that we had made some acquisitions earlier. Actually it was late last year, and we acquired a company called Metricly and that’s now our cloud wisdom offering, and Metricly, one of their core pieces of IP that was intriguing to us was their very advanced anomaly detection and the way in which they had leveraged that along with very specific heuristics to introduce the ability to manage risk. And so Cloud Wisdom is a SaaS-based cloud monitoring tool. It’s really an acquisition to take us to hybrid cloud and cloud cost management because our customer said one, my cloud costs are out of control, but two, I can’t just slash and burn. I need to understand the impacts on things like performance and availability or risk tolerance of my business. And that was one of the unique aspects that they actually introduced in the product, which was the ability to dial in very specific thresholds for how much performance is required and how much cost is important to me and how much risk tolerance do I have around this particular workload. And then run that through advanced heuristics to say here’s the optimal balance of reserved capacity versus on demand capacity to meet the optimal point of maximizing or optimizing performance cost and risk. And that was one of the first applications I had seen really calling out risk as a variable or risk as an input in those decisions. So it is starting to come. We’re rolling that capability into our broader hybrid cloud management offering, trying to bring that same type of analysis back on-prem, and then let customers manage the balance between what should live on-prem, what can live in the public cloud, and how might I arbitrate that workload placement going forward with a keen understanding of risk. I mean, we have one customer is a very large SaaS provider and they said flat out in looking at the platform, “We’re going to dial the risk tolerance all the way to the tightest. We have zero tolerance for risk and we’re willing to pay a little extra or even a lot extra as a result.” So they actually appreciated that they could mitigate risk by dialing that knob in a certain direction versus another customer that’s only looking at cloud for noncritical systems and said, “I’m willing to make the risk factor a nine and save as much money as possible in the process.” So I do think it’s starting to enter the conversation, but I do think Gartner actually had this one fairly right in terms of its maturity. It’s mostly in concept with a few real world applications.
Guy Nadivi: Well, let’s discuss a different type of risk. The entire value proposition of AI and machine learning being able to get you to the point where you can predict failures before they happen and mitigate them in advance is heavily predicated on one particular skill, data science. If you don’t have the data scientist to build the algorithms to generate the predictions, then you can’t leverage AIOps. And right now there is a very big shortage of data scientists. The August 2018 LinkedIn workforce report stated that there was a nationwide deficit of over 150,000 data scientists. How will companies like Virtana overcome this staggering talent shortage?
John Gentry: Oh, it is absolutely challenging. I think it brings up an interesting sort of trade-off that I’m certainly seeing in my customer base. We have a little bit of advantage because we did have a head start. We started building out that capability before there was a massive shortage. We were a little bit ahead of the curve. I’ve seen customers trying to balance the decision of build versus buy, right? There were a lot of companies that were naturally in the data science business. Think about insurance firms that have large teams doing multivariate analysis for risk assessment and setting policy pricing and so forth. I saw a lot of them early on saying, “Well, we’re just going to pump everything into a massive data lake and apply our data scientists to it. And we’ll build a solution that will solve world hunger in our data centers, but we’ll do it on our own.” I think that is less and less of a reality because the thing that’s lost in just genericizing data science is there’s a level of applied knowledge that has to exist for the data scientist to actually pick the right math and know what outcome they’re solving for algorithmically. And so I think you’re going to see more companies looking to buy the capabilities out of the box as opposed to build them themselves, which should help to some extent that data scientists be able to be leveraged. If I can build a capability into a product and multiple entities can leverage that product, I don’t need to have a data scientist and all those entities. Now that’s really the trade-off between the customer and the vendor in terms of the build versus buy. In terms of the competition in the vendor ecosystem for those data scientists, it’s forcing people to look very hard at geography. Interestingly enough, we made the very conscious decision to open a development office in Bend, Oregon, because frankly, the competition for talent in the Valley is extremely fierce. A lot of the next generation of data scientists, they want to be in a beautiful part of the world that’s not overly urban, that offers them lifestyle choices that allow them to do more than just their job. And we’ve had great success with that location and actually really happy employees there that tend to write really good code. I think you’re going to see that kind of creative sourcing and creative appeal. Not just money and marquis name of company, but also whole lifestyle offerings that attract those data scientists. And then you’re going to see hopefully a lot more students and a lot more diversity in the workforce going into the field of data science, and maybe even certainly some reskilling of the labor force. I think this is such a wave of transformation and the promise is significant enough that I think you’re going to see some retooling to fill that gap. Not just new grads or existing statisticians, but it’s certainly a challenge. It’s forced us to change our organizational structure and our geo-locality. I’m sure it’s having similar effect on other organizations. That’s definitely something I know I’m pushing my son to consider a career in data science. That’s for certain.
Guy Nadivi: I think Bend, Oregon, also happens to be the location of the world’s last Blockbuster video. So maybe that might be another big draw for your recruiting efforts down there. John, I’m going to ask you to fire up your crystal ball and speculate about what kinds of features and benefits AIOps tools will provide three to five years from now.
John Gentry: Yeah, well in the state of the world right now and the rate of change that’s happening, I’m pretty guaranteed to get this at least mostly wrong, but I’ll certainly give it a shot. I think from a features perspective, the promise of AIOps has always been automate the mundane, the repetitive, the painful parts of IT operations to free up the people, the intelligence, the creativity to go innovate. And so I think you’ll see tighter coupling of the AI engine and the automation platforms and more closed loop approaches that can continue to self-learn or self-heal. I know we’ve talked about the self-healing data center. I think as you look three to five years out, they’re going to have to natively understand it is a hybrid cloud world. I think just applying AIOps to a private cloud data center, or just leveraging it for public cloud management is going to fall short of customer requirements to really manage workloads on-prem, off-prem, and transitory between the two locations. If I can proactively understand, based on business patterns, when it’s more cost efficient to run a massive say business closed process on-prem, but during normal operations that’s better done in the cloud, I’m going to want to be able to automate that migration and that arbitration between workload placement between the two. So I think an inherent understanding of the trade-offs between on-prem and public cloud is going to be something that has to be there in the next three to five years. I think increasingly, as I mentioned before, cost is going to be an inherent input to that. I think the promise of cloud being cheaper has been fully debunked. And so now we really need to take a hard look at cost management and how that plays into the arbitration between workload placement. I think more and more simulation test capabilities. AIOps today is very much a reactive type of platform. I think leveraging more advanced mathematics, things like I mentioned earlier, like Monte Carlo simulation to look at all the permutations of a configuration or workload placement to pick the optimal one based on other specific set of inputs, and then to go and test that theory to simulate that, and then validate before taking action. I think those are all capabilities that are going to be required, particularly as you start to move more and more toward humanless automation, right? So if I can actually now leverage AI to understand a pattern, to make a proactive recommendation around workload placement, and then actually go simulate that placement and validate the result, validate my AI, and then execute based on validation and then close loop monitor the outcome, now that’s a very sophisticated full stack, full life cycle approach that I think that’s going to be absolutely what you’re seeing in the winners out there and their capabilities in the next three to five years.
Guy Nadivi: Interesting. John, for the CIOs, CTOs, and other IT executives listening in, what is the one big must-have piece of advice you’d like them to take away from our discussion with regards to implementing AIOps at their enterprise?
John Gentry: Well, I guess the first thing I would say would be don’t believe the hype. A lot of the marketing sounds the same. There’s an incredible perceived overlap in various providers and their capabilities. There’s a lot of promises that for better or worse in IT, we’re used to vendors over promising and under delivering. And there’s that acceptability rate of somewhere between 70 and 80% actually being delivered with AI. That’s not an acceptable ratio, right? Because now you’re talking about really automating and applying intelligence and such to your business. So really peel back the onion, get to the meat, ask the hard questions about exactly what data science is being applied and when and how. I was talking with my CEO about actually building a Hobbesian pyramid of AIOps to explain how the various mathematical approaches need to be layered on each other to get the various incremental returns. So I think really digging into the true capabilities beyond the marketing would be the big piece of advice. If I had a secondary piece of advice it would be really look at what problem you’re trying to solve. One of the most powerful questions I’ve had throughout my career, one of my early mentors infused within me is always ask what are you solving for? I think people get enamored by technology and they go out and throw technology at a problem that may not be the problem that needs to be solved. So, I would say, don’t believe the hype, do your homework, but also sit back and think hard about what is it you’re solving for, and then look for the approaches or the technologies, or even the vendors that seem to resonate in that being the outcome that they’re driving for as well.
Guy Nadivi: Excellent advice. Alright, looks like that’s all the time we have for on this episode of Intelligent Automation Radio. John, your reputation as a fervent and articulate advocate for AIOps preceded you, and you definitely did not disappoint. I’ve really enjoyed absorbing your boots on the ground perspective today on the state of the market. Thank you very much for coming onto the podcast and sharing your views with us.
John Gentry: Well, always a pleasure, Guy. Really enjoyed your questions. Very insightful. They helped get to the root of the issue. So I’m very hopeful that your audience has benefited from this, and I look forward to collaborating with you again in the future.
Guy Nadivi: John Gentry, Chief Technology Officer and Senior Vice President at Virtana. Thank you for listening, everyone. And remember, don’t hesitate, automate.