After a Year of Cloud Outages, IT Resilience Is a Boardroom Problem

“The October failures at AWS and Azure were not really about the cloud breaking. They were about how long it took everyone to recover.”

For roughly fifteen hours on October 20, 2025, a large slice of the internet simply stopped working. A DNS fault in a single Amazon Web Services region cascaded outward, and banking apps, airlines, retailers, and streaming services went dark across more than 60 countries. Outage trackers logged the scale in real time. Downdetector recorded over 17 million reports. Ten days later Microsoft Azure did much the same thing, with Heathrow, Vodafone, and Alaska Airlines among the names knocked offline. Two failures, ten days apart. Almost none of the companies caught in them had done anything wrong.

Here is what has actually changed. An outage used to be the IT department’s bad day. Now it is a line item the board has to account for. Motadata’s Hidden Costs of Downtime study, produced with Oxford Economics, puts the aggregate annual cost of unplanned downtime for the world’s 2,000 largest companies at 600 billion dollars, a 50 percent jump in two years. That comes to around 15,000 dollars a minute, and an average of 95 million dollars in lost revenue per company each year. Shareholders feel it too. The same research found that stock value drops an average of 3.4 percent after a single downtime event.

So the reaction after October was predictable. Boards asked whether they had grown too dependent on one cloud provider. It is a fair question, and the concentration of so much traffic in a single AWS region is a real design flaw. But it is the wrong place to start, because it treats the outage itself as the failure. The outage was never really the failure. The recovery time was.

What Actually Took So Long

Look at how those hours played out. The systems went down fast. What dragged was everything after. Status pages stayed green while customers flooded support lines. Engineers learned their own service was broken from social media before their monitoring told them. Independent monitoring firms documented the same pattern across the big 2025 outages: dashboards that lagged reality by 30 to 90 minutes, and teams that heard about failures from users first. Detection, the thing most monitoring tools are sold on, was not the bottleneck. The handoff from spotting a problem to resolving it was.

That gap is where the money burns, and most of it is self-inflicted. A typical mid-sized IT team runs one tool for metrics, another for logs, a third for network flows, and a separate ticketing system bolted on top. When something breaks at 3 a.m., an engineer is alt-tabbing across five screens, trying to assemble a story the tools should have assembled already. Every minute spent correlating by hand is a minute the business keeps bleeding at that 15,000-dollar rate. You cannot stop someone else’s cloud from failing. How quickly your team sees it, understands it, and routes it to the right person is entirely yours to fix.

The Two Halves Nobody Connects

This is the case for unified observability, where metrics, logs, traces, and network flows live in one correlated view instead of five disconnected ones. A team that can see cause and effect on a single screen does not waste the first thirty minutes of an incident just working out what broke.

The harder half, and the part most resilience conversations skip entirely, is what happens after the alert fires. An alert that lands in an inbox nobody is watching is barely better than no alert at all. The work of closing the loop, so that a detected anomaly opens the right ticket, reaches the on-call engineer who can act, and triggers a known remediation, is where recovery time actually gets compressed. Detection and resolution get treated as two separate purchases by two separate teams. The outage does not care about that boundary. The clock runs straight through it.

What Regulators Already Understand

There is a second reason this has moved up to the board, and it is sharper in regulated industries. In banking, telecom, and healthcare, a prolonged outage is no longer just lost revenue. It is increasingly a reportable operational-risk event, with regulators across Europe, India, and the Gulf treating service availability as a supervised obligation rather than an internal IT metric. A retailer that goes dark for three hours loses a day of sales. A bank that goes dark for three hours may owe an explanation to a regulator, and sometimes a penalty. The stakes are not symmetrical, and the boards in those sectors already know it.

The Honest Limit

None of this prevents the next AWS region from going down. It will. Outages at this scale are now a predictable feature of cloud infrastructure, not a freak event, and no amount of tooling on your side changes what happens inside someone else’s data center. What you can change is the blast radius and the clock. The companies that come back in minutes rather than hours are not the ones with the most dashboards. Splunk’s researchers call them resilience leaders, and what separates them is faster recovery, not fewer incidents.

The CrowdStrike update in July 2024 took down an estimated 8.5 million Windows machines in an afternoon and, by industry estimates, cost Fortune 500 companies around 5.4 billion dollars. October 2025 showed the pattern had not shifted. The lesson the board should carry out of all of it is not which cloud to bet on. It is to treat recovery speed as a number the business measures, the way it measures revenue or churn. An outage you recover from in eight minutes is an operational footnote. The same outage at three hours is a quarter you have to explain. The difference between the two is a choice about how you build, and it is worth making before the next region goes dark.

Source: FG Newswire

Leave a Comment Cancel Reply