Engineering • 13min read

A Retrospective of the Major Cloud Outages in 2025

A deepdive analysis into the global cloud disruption in October and November 2025. What really happened and how can you prevent future impact?

Written by:

David O'Dwyer

Published on:

Jan 23, 2026

Last updated on:

Feb 26, 2026

Introduction

In the last few months of 2025, we saw several key cloud service providers fail with global impact. These very public failures have highlighted their importance, but also highlighted the concerns for organisations which are reliant upon them. This is a retrospective on the three major outages that occurred between October and November 2025. We will look at the details of all three outages before exploring how organisations can be better prepared in the future.

AWS US-East-1 Regional Outage (October 20th 2025)

The incident was a latent DNS race condition that triggered a massive cascading failure across core services. The important thing to note here is that DNS got blamed, and that was what everyone latched on to initially. But blaming DNS in this instance is like blaming your house for burning down. The problem was related to DNS, yes, but it wasn’t DNS’s fault.

The outage lasted about 15 hours (core impact), but it took 24 hours to get fully restored. This was due to the long tail of services, which were being systematically brought back, and a lot of the time, services were down for wasn’t because of the actual issue, but was slowly and safely brought back online after the issue was fixed.

113 AWS services were affected in the US-East-1 region, including DynamoDB, EC2, Lambda, and SQS. US-East-1 is the primary region for these services.

The basic root cause was a DNS race condition that accidentally deleted all DynamoDB endpoint addresses. It literally wiped out all the IP addresses in DNS related to DynamoDB. DynamoDB is a very scalable database. A number of AWS internal services use DynamoDB to store stateful information, and the moment DynamoDB was unavailable, all those other services started to experience issues as well.

Diving a little deeper into the actual root cause analysis, you had two separate systems. You had a DNS planner and a DNS enactor. They managed the IP addresses across an entire fleet of load balancers, load balancers which were in front of the global DNS entry for DynamoDB.

Under normal circumstances, when updating IP addresses, a plan would be passed over to another service, which picks it up as an action, and then updates the IP addresses in the plan for all of the various load balancers. But given this is AWS, there is lots of scale, hence why they separate this out. What happened in this instance is that there was a plan that was being enacted, but it was very large and therefore very slow, and there was a condition where it didn’t quite get processed quickly enough. So you had this really slow plan in action. At the same time, with exactly the same timestamp to the millisecond, another plan was enabled. That got picked up as the active plan because the slow one hadn’t registered as succeeded, finished or completed. So this second, faster plan raced ahead enabling and updating all of those IP addresses. When the slow plan came along and finally finished, it got to the end and went ‘ah actually I need to get rid of all the unneeded plan entries’, and it registered the second and now active plan as obsolete. So what then happened is it deleted all the IP addresses, and removed them all from the load balancers, which then caused this massive cascading issue throughout all of the services.

The resulting issue was compounded by the fact that, because there were no IP addresses in the system, they couldn’t be updated. AWS had to do everything again from scratch, which then caused these sorts of issues. The fact that US-East-1 hosts a number of their global control plane services, including their internal control plane services, this caused issues which spanned across multiple different regions as well.

Azure Global Connectivity Failures (October 29th 2025)

The Azure incident was a global connectivity failure where the edge nodes refused incoming connections. The duration was a very specific 8 hours and 24 minutes.

Services affected were Azure Portal, Teams (which no one minded), Outlook (which again no one minded), Xbox Live (which people did mind) and Azure AD (which a lot of tech people minded).

The general root cause was that a configuration change to Azure Front Door exposed a latent bug that bypassed all their safety validation. Azure Front Door is Microsoft’s global content delivery network and application acceleration platform, essentially the front door to their services. The configuration change caused AFD nodes across their global fleet to fail to load properly, leading to widespread connection timeouts and errors.

Delving a little deeper, what happened is that an inadvertent configuration change was deployed to Azure Front Door. This change caused a significant number of AFD nodes to fail when they attempted to load the new configuration, but the problem Azure had was that this payload didn’t cause an immediate problem. The problem occurred 5 minutes after loading. How it works within Azure, and a number of other services, is that they have this ring effect to process automatic deployments. Following a deployment, if everything works and is deemed healthy for a period of time, it then rolls out incrementally through their rings.

Azure, as you’d imagine, are pushing a lot of changes a day, and so the time it takes for them to register something as a success is relatively small, smaller than 5 minutes.

What happened is this change got rolled out, distributed through the rings globally, and then the issue happened, and it just cascaded all the way through to the edge. To compound upon this issue, when it came to recovery, their last known “good” backup was also corrupted by this config change, because again, they registered the change as good. Therefore, they couldn’t fail back automatically as easily, which expanded the outage window we saw.

Whilst DNS issues were visible during the incident, as AFD nodes experienced DNS resolution problems when they crashed, the root cause was the configuration change to Azure Front Door itself. This demonstrates how symptoms can often be mistaken for root causes during complex outages.

Cloudflare Bot Management and Core Proxy Services (November 18th 2025)

The Cloudflare incident occurred due to connectivity issues characterised by oscillating availability. You would have noticed during the incident that pages were going up and down every five minutes. If you had an alert system, you would have seen this and got massive alert fatigue because you were just getting spammed by your services going up and down.

This took about six hours in total, with three hours of major impact. The services affected were CDN sites, Workers, and all of the main ingress services for Cloudflare. In the current IT landscape, Cloudflare sits in front of a lot of other infrastructure. You can have infrastructure hosted on AWS, on Azure, on prem, but a lot of people put Cloudflare in front of it because they have very cost-effective denial of service protection, and it’s become a bit of a standard component in a lot of architectures to put this in front of other architectures. This widespread adoption is why we saw such a major spread in the impact.

The root cause was a database permission change that doubled the config size.

What Cloudflare attempted was a routine security improvement to the database permissions, but it had some unique behaviour that changed the query.

As part of their bot-detection system, a particular file is generated every 5 minutes containing a bunch of metadata. Upon this file being generated, it is fed into their bot management systems, allowing them to detect shifts, trends, and attacks from bots.

What happened with this database query was that it essentially doubled the size of this file. The file now exceeded their hard-coded 200 limit, a previously assigned limit which they expected this feature file to have. This file was then distributed across their systems, causing the software that runs the ML bot protection to keep crashing. This meant there were no incoming connections, and incoming connections weren’t allowed because they couldn’t validate them against this ML bot.

The oscillation experienced was due to the fact that it was a rolling database update. If you were in the broken area with a database that had been updated and that database used an automatic 5-minute cron job, it would send out a broken config. On the contrary, if a query randomly hit one of the non-updated databases on the next 5-minute increment, it would send out a good config, and all the systems were coming back up. This really confounded the issue, because it was hard to determine whether it was an attack, since it came in these sorts of waves.

When Cloudflare realised what was happening and that the timing was due to the doubling of this file, they were able to roll back the query change, which then slowly started to roll out to the services.

Are your teams struggling with infrastructure complexity?

Explore how platform engineering principles can streamline your Kubernetes deployments and development workflows. Our three-week pilot validates cloud-native solutions tailored to your specific infrastructure challenges and team requirements.

Discover whether modern platform approaches can reduce your operational burden and accelerate feature delivery, whilst maintaining the security standards and compliance your organisation demands.

How can your Organisation be better Prepared?

What can be done before an outage?

One of the key things we would advise organisations to look at is hidden dependencies. What took down a number of systems, a number of sites, and a number of corporations wasn’t the public infrastructure. The issues faced were with the infrastructure that supports their infrastructure.

For instance, Docker Hub was down, preventing teams from pulling container images for new deployments. There were issues to do with alerting systems, so the systems set up for you to go in and monitor when you encounter issues were down because they sat behind Cloudflare. And many third-party systems were unavailable that normally tie into your CI/CD and supply chains.

So, we encourage you to ask your supplier if the service you’re buying is regional or global. Should you expect this to be available during a regional outage? How wide is the SLA? How wide is the availability? The next key topic to examine as an organisation is static versus dynamic failover. We all know that multi-cloud resilience costs money. So the question to ask execs and sponsors is, what is the business impact of a 4-hour outage? And is investing in resiliency cheaper than what the business impact is worth?

Another alternative to consider is the possibility of having a static or a low-fidelity light failover site. When we worked at Comic Relief, one of the primary things that we installed was this idea of a light version of the main Comic Relief page for when we did the TV night. Unfortunately, they moved to a new provider and on the televised performance, there were issues, and so we had to switch to the light page, which was hosted on completely different infrastructure, a completely different pathway, but most importantly, it still did 80% of the jobs which were deemed necessary for that night.

We would strongly encourage organisations to be looking at what their equivalent of that would look like. If you’re an organisation which has only got a static page, you can survive the 4 hours. If you’ve got services, however, what’s the minimum amount of services that you need to maintain revenue streams on an alternative provider that you can do?

What can be done during an outage?

The actions taken during an outage is critical. One of the key things we see again and again is this idea of “time to decide”. You’ve designed, implemented and got this designated failover system. You’ve spent the money, and your team has expended a lot of time and effort, but you don’t fail over because you believe “they’re going to fix it soon”, “it’s going to be resolved in a minute”, “they’re promising me it’s going to be done in 30 minutes”, “why am I going to fail over, because then I introduce the problem and the work involved with coming back”.

You need to take the decision out of the human’s hands. The decision needs to be pre-made, and you do this in accordance with a pre-defined time to decide metric for your business. When the next event happens, and during that downtime, when that metric hits, that’s when you do the failover. The failover happens regardless of the recovery time, and regardless of when “person A” decides.

For example, if you have 15 minutes of a 25% failure rate, then you fail over. It’s very important to clearly define it as a business metric. It’s not an individual’s decision anymore, it instead becomes a business decision that’s signed off in advance.

The second point to take note of during the outage is to avoid retry storms. The decision is made that you’re failing over, but how do you get back safely? A number of the issues we’ve seen is that people do enable these static sites, or these light sites, but they don’t have a way of gracefully bringing back traffic to avoid a retry storm. We strongly advise that you invest in systems which allow percentage-based failover. There are many patterns that you can use to achieve this, but they’re going to be very unique to your setting, so do the relevant research and make the necessary investment.

What can be done after an outage?

One of the things that we’ve spoken about again and again is this idea of options. We’ve previously been on panels and presented the idea of a multi-cloud strategy. For some people, they’ve assumed that we’re saying you must always run your software across multiple clouds, multiple providers and multiple services 100% of the time. That’s not the case.

At LiveWyer, we push the architectural principle that you should give yourself the option to do these things. If you engineer things according to this principle, then you allow yourself options further down the line.

Ask yourselves the following questions, is it cloud-native or agnostic? Do you have an option for alternative services? Is your workload hard-aligned to underlying cloud services being used? What we tell our customers is to invest in developing to open standards and to abstract where possible, if a provider offers it.

One of the really big wins is the idea of open telemetry. A number of organisations were hard-coding all their telemetry, monitoring, and data logs to a very specific telemetry provider. In this scenario, when it comes to end-of-year negotiations, they knew that they didn’t really have a leg to stand on. Whilst a number of organisations pushed back hard on the introduction of this idea of open telemetry standards, it allows them to store and transmit data, and interact with services in an open way, so that they can shift their data to other providers if needed.

There are many examples like this, not just with observability, and we think companies should consider them going forward.

We often talk to companies that insist they are not moving away from AWS, and that is fine. However, what about the third-party providers that did let you down? What about some of the services which let you down? Can you move away from them? Are you tied to them? That is the question that we will leave you with, and we encourage you to be asking going forward.

Need help making your infrastructure more resilient?

At LiveWyer, we have guided numerous organisations through complex Kubernetes infrastructure transitions, helping them evaluate options, plan migration strategies, and execute changes without service disruption.

Cloud Platform Engineering

Workload Modernisation

VMware Kubernetes Migration

Case Studies

Whitepapers

Blog

Tutorials

Cloud Native News

About

Partnerships

Careers

Contact