Although we get different messages from cloud computing providers, we now have data that suggests public cloud outages are getting worse. The Uptime Institute recently released its 2022 Outage Analysis report that included such findings as “high outage rates remain an issue.” Indeed, one in five organizations reported a “serious” or “severe” outage that resulted in significant financial losses, reputational damage, compliance breaches, or, in some severe cases, loss of life. The report concludes that there has been a slight upward trend in the prevalence of major outages in the past three years.
I’m usually not one to bust out the quotes, but this statement by Andy Lawrence of the Uptime Institute is worth mentioning: “The lack of improvement in overall outage rates is partly the result of the immensity of recent investment in digital infrastructure and all the associated complexity that operators face as they transition to hybrid, distributed architectures.”
Complexity is not a new challenge for IT. However, we recently created much more complexity through quick digital transformations and the wild rush to cloud and multicloud in response to the pandemic. These factors resulted in a new, high headcount in the types of systems that support businesses. Most enterprises reported that they once supported about 500 cloud services for the entire enterprise and now support about 3,000 services over a multicloud deployment.
These numbers indicate that the technology doesn’t cause the outages; it’s how the technology is used and the amount of technology in use. As the report states, nearly 40% of organizations have suffered a major outage caused by human error. Of these incidents, 85% have a root cause of staff failing to follow procedures or flaws in the processes and procedures themselves.
The root causes of complexity are well understood. There are many more moving parts to oversee in multicloud and cloud architectures and not enough money to quadruple operations staff. Cause, meet effect.
Why does this complexity happen in the first place? Much better operations tools are now available, such as AIops and cross-cloud multicloud monitoring solutions. These tools allow developers and innovators to leverage best-of-breed technologies to build and deploy business-changing technologies. Developers can deploy the optimal choices for storage systems, AI systems, compute, databases, etc., that may come from one or (more likely) many cloud providers.
The result is a complex and highly heterogenous multicloud deployment that requires staff with specialized skills to effectively operate and limit the number of outages. Ironically, most IT organizations can’t get approval for an increased ops budget because cloud computing promised to make operations less expensive.
What’s the solution?
As I’ve stated here a few times, abstraction and automation layers remove humans (and human errors) from the front and center of all operations processes. These layers also include tools for ops planning or replanning to optimize multicloud operations, which can take your operations game to the next level.
That brings us back to the original problem. Rebooting cloud and multicloud operations to incorporate abstraction and automation layers translates into more money and skills. Until enterprises reach a tipping point where the complexity costs more to manage than it does to directly address, we’ll see more outages.
It’s too bad that we must do damage just to understand how to avoid doing damage. Sadly, we’ve been here many times before.