Editor's note: This article originally appeared in the Wall Street Journal
Organizations around the world recently had a stark reminder of what happens when computers and cloud providers stop working, sometimes at the same time. The outage occurred when a cybersecurity software provider’s update caused a leading operating system to crash, leaving airlines, health care providers and financial services companies among those that had to urgently find workarounds. This was followed by several other prominent cloud provider service disruptions over the next few weeks.
These disruptions should give corporate leaders pause for thought, especially where cloud services that rely on constant connectivity are concerned.
Cloud services have become the backbone of operations for many companies. As reliance on these services grows, so does the potential impact of operational disruption from outages or security breaches. Some business leaders are delaying moving their most important applications and operations to the cloud due to concerns about downtime and the unknown consequences of being offline.
Rather than miss opportunities to make the business more agile and realize significant cost savings that a shift to the cloud can deliver, senior IT decision-makers should instead focus on “what if” scenarios and business continuity planning before selecting a provider to host mission-critical applications in the cloud.
The real differentiator
“Reliability isn’t the first thing most cloud buyers think about,” says Misha Kuperman, Zscaler’s chief reliability officer. “Most cloud service buyers are primarily concerned with feature sets, but the real differentiator lies in what happens when things go wrong. This is not something that is usually advertised or touted by the sellers.”
For businesses, the stakes are high. An acceptable level of uptime for mission-critical services now often exceeds 99.999%, equating to just a few minutes of downtime each year. Achieving this requires a robust infrastructure as well as an unwavering focus on resilience and appropriate cultural and development priorities. Resilience—the ability to return to normal operations after a brief period of disruption—can be increased through preparation and planning, but leaders must also consider critical failure and disaster scenarios where operations are disrupted for days, weeks or longer.
“Disaster recovery is what your customers do when you are no longer there,” Kuperman says. “It’s kind of like planning your own funeral. You’re no longer going to be around to help.” Recovering from a disaster requires preplanning and coordination between the customer and provider, including the customer being aware and in the position to leverage the provider’s tools—a shared responsibility model where each party has responsibilities and full understanding of what to do if a force majeure event was to occur, including declaring the start of an event.
“The ability to declare a disaster, including the timing, impact and who should be involved are table stakes for any business,” Kuperman says. “Additionally, the regulatory environment around the world is also starting to mandate resilience and readiness for such events.”
Reliability is key for DORA compliance
Starting in January 2025, financial entities and critical third-party technology service providers will be subject to the European Union’s Digital Operational Resilience Act (DORA), which requires resiliency testing and oversight of providers. Under DORA, vendor risk assessments and security certifications are not enough.
For Ignacy Prystupa, IT service manager for zero trust at Raiffeisen Bank International, which has over 18 million customers across central and eastern Europe, testing is essential to avoid any of the hundreds of third-party products, including Zscaler’s zero trust internet solutions, becoming a single point of failure.
“As a critical infrastructure provider, there are significant financial, regulatory and reputational risks if services go down,” Prystupa says. “Not just for us, but also for the customers we serve.” RBI regularly conducts exercises on critical infrastructure, including restore tests, done by deleting applications or services from test environments, measuring how long it takes to recover to normal operations from backups, and assessing how much data loss the business can tolerate in the case of an outage. “A lot of regulators are asking about this now,” he says.
“At the end of the day something’s going to go wrong,” Prystupa says. “It’s not a question of how or where, it’s a question of when. You really have to go down the rabbit hole to the absolute worst-case scenario.” Further, organizations not only need to plan for what would happen in individual scenarios, but also if multiple scenarios occurred at the same time.
Minimizing downtime
Kuperman says Zscaler’s top development priorities are security and reliability. “That goes over features, that goes over everything else.” As part of that effort, he says, the company has invested in building its own cloud and owning the infrastructure for its core offerings so it controls all the underlying components. That is not something you get when building on someone else’s platform.
Zscaler builds enough scale into its operations to be unaffected by an outage at any particular data center, and makes tools and automation available to help customers minimize the amount of downtime they are likely to experience. Kuperman says many of the company’s 7,500 customers see 100% uptime because they have implemented all the resilience and reliability best practices and integrations and tools Zscaler provides.
All Zscaler updates go through rigorous testing and validation to avoid the issue at the heart of the recent global outage, and are then released in a staggered manner. Customers control rollout policies, allowing them to test at their own pace.
Asking the hard questions
Kuperman recommends leaders ask themselves hard questions ahead of buying a cloud service. “Executives must look beyond uptime statistics; they should inquire about disaster recovery capabilities, compliance frameworks and the provider’s ability to scale rapidly in response to natural and human-made disasters as well as cyberattacks,” he says. Regular disaster recovery exercises—ideally twice annually—that detail roles and responsibilities and communication plans are also essential.
“The true measure of a cloud provider isn’t just in their performance on sunny days,” Kuperman says, “but in their resilience when storms hit.”
What to read next
Preventing Cloud Outages [video briefing]
Smarter cloud connectivity ensures resilience against inevitable issues