In my previous blog – AIOps: Separating the Hype from Reality, I pointed at AIOps being a recent term, but not being a new idea, and also suggested that the power of AIOps should not be ignored due to some products branded as AIOps platforms not containing much intelligence.

Undeniably useful

In the report published by Gartner “Market Guide to AIOps Platforms” (Prasad & Byrne, 2021) the authors state that “There is no future of IT Operations that does not include AIOps. This is due to the rapid growth in data volumes and the rate of change … that cannot wait on humans to derive insights”. The question regarding how vendors bring their various solutions to the market in what one cynic has termed “the same tools with a layer of pixie dust over it” does not invalidate the power or the capability that innovative new AIOps solutions bring to IT Operations. Whether point solutions that only address a few use cases or broad platform-agnostic solutions that attempt to address most use cases – AIOps has progressed our understanding of the IT environment and allowed us to resolve issues quicker, and smarter with reduced impact om our IT personnel.

While running the risk of repeating what many have already stated, it might be useful to summarise the what AIOps is. Although AI is useful in many fields – as has been illustrated by the current conversations regarding Chat GPT, AI - or machine-learning systems - have the potential to play a major role in Infrastructure and Operations solutions. The reason for this is that AI can leverage on the mass of institutional technology knowledge that is already applied by humans but can speed up conclusions by increasing the scope of assessment to much larger datasets and to many more domains. In short – AI can consume data at a rate that humans can’t and make sense of it at a speed that humans can’t. It will also retains the knowledge and does not have to be retrained every time the specialist guru moves to another company or retires.

Consider the diagram below that represents where AI can be/is applied in the IT Operations cycle. The basic cycle is to Measure, Engage, and Act. This cycle is not only applicable to IT but to any governance process. Any effective governance process – in government, business, civil society –  literally any governance process contains these steps.

AIOPS-cycle-content-image

Phase 1 - Measure

Examples of monitoring – police on the beat, police intelligence, measuring profitability of a business, even something as mundane as cooking a meal or losing weight is a form of monitoring. You measure the existing state and look for anomalies.

Police looks for suspicious activities. They do this considering the location, the people moving in the location, the risks associated to the location. If for example a specific area is know for high instances of pickpocketing, the location is important. If a serious traffic offence takes place, like a hit-and-run, it is important for police to understand the layout of the streets to understand what the best route of pursuit is or where roads can be blocked.   The parallel to this in the IT ops world is monitoring. That is placing tools in the environment that measures specific metrics, logs, and discover the topology to provide context for understanding the potential and impact of anomalies.

The police do not normally act unless somebody breaks the law or might break the law. They know what breaking the law look like, because they know the laws that govern their country, county, or city. The parallel to this would be event generation – when a threshold has been set, an alert or event is generated that shouts out that a “law” has been broken.

They also deploy police intelligence – people who listen out for when people intend to break the law. This is normally part of crime prevention – a proactive measure taken where police attempt to identify that a law might be broken. This requires an understanding of the context, the location, the interrelationships between people, organisations to get to the root cause of the problem.

In the IT Operations world, this would be proactive analysis of the metrics where the metrics are studied using known behavior and could then proactively determine that there is a probability that something might go wrong if there will be no intervention.

In the “Measure” phase of the IT Operations Lifecycle there are several innovative ways that AI has been used and is used in various products. One example is using past behaviour to determine what is normal behaviour and what is abnormal behaviour. In my previous blog I mentioned the development of ProactiveNet a decade ago, but many current Monitoring toolsets use this technique to proactively identify potential outages. Topology mapping can also utilize AI to identify probable relationships between Configuration Items based on an understanding of existing deployed services.

In conclusion, the Measure phase of the IT Operations cycle is critical in identifying potential anomalies and their impact. With the help of AI, we can monitor and proactively prevent potential outages faster and smarter. And remember, just because some products are branded as AIOps platforms doesn't mean they contain much intelligence. It's important to separate the hype from reality and evaluate each product carefully. That's all for now, Look out for my next blog where we discuss the Engage phase and how AI can benefit IT Operations.