Article

When the Cloud Goes Dark: How to Survive a Major Cloud Vendor Outage

by Gary Worthington, More Than Monkeys

On 20 October 2025, a large chunk of the internet went offline when AWS had a serious outage in its US-East-1 region. What began as a fault in DNS resolution and internal control-systems quickly cascaded, taking out everything from social-apps and gaming services to banking platforms and IoT devices. The incident reminded us that even the biggest cloud vendors can suffer catastrophic failure.

So the real question is not if this will happen again but how ready you are when it does.

Here are ten practical and deeper strategies to reduce the blast radius of a cloud-vendor outage and keep your business running.

1. Know what you actually depend on

Most engineering teams don’t really have a full map of which regions, availability zones (AZs), vendor-services and managed components their system depends on. Many start operating in the default region (US-East-1 for AWS) and only later realise they’ve built into a single point of failure.

You should:

  • Inventory every subsystem and map it to its region and AZ. For example: “our authentication service runs in us-east-1a, uses DynamoDB in us-east-1, and our message processing service uses SQS in us-east-1”.
  • Identify the managed vendor services your business cannot function without. For example: IAM control-plane, global DNS, managed DB services.
  • Ask: “if this region vanished for 24 hours, what breaks?” If you cannot answer that with confidence, you have work ahead.
  • Recognise that some services appear “global” but in fact rely on a control-plane or endpoint in US-East-1. For example, in AWS the control-plane for IAM and Organizations is hosted in US-East-1.

This clarity gives you a starting point to rationalise risk: you can only protect what you know you’ve got.

2. Build for more than one region

Having multiple AZs in one region is good. Having multiple regions is much better. If you only run in one region and that region fails, you have no fallback. In the case of October’s outage, many services in US-East-1 went down because they had no real cross-region slaves or serve-through paths.

Here’s what to think about:

  • Identify the business-critical services (e.g., payment systems, authentication, customer API) and deploy them across at least two different geographic regions (for example us‐east-1 + eu-west-1).
  • Use cross-region replication so when region A goes dark, region B has up-to-date data. But you’ll need to balance cost, latency and eventual consistency trade-offs.
  • Locally test your fail-over path — make sure it works, including DNS-switching, traffic redirect, data lag handling. Just planning it is not enough.
  • For managed vendor services, check whether they support region fail-over and whether you’re using them in a region that supports fail-over. Some are region-bound and may not be seamlessly portable.

By designing for multi-region resilience you substantially reduce the chance that a single region failure becomes a full outage.

3. Stop locking yourself to one vendor

Going “all-in” on a vendor’s managed services is tempting — they’re easy, integrated, fast to launch. But it all comes at the cost of vendor-dependency and less flexibility during failure. In the October outage we saw the domino effect: one vendor region outage hit many downstream systems.

Consider these approaches:

  • Identify your “must survive” services and evaluate whether you could run them in a second vendor (for example, dual-cloud with AWS and Azure or GCP) or alternatively have a self-hosted fallback.
  • Abstract vendor-specific APIs behind your own interface so you’re not locked in. That way, if vendor A has issues, you can switch to vendor B or self-hosted with minimal rework.
  • Recognise that even if you replicate across regions within the same vendor, you may still be vulnerable if the control-plane for a “global service” is in a specific region (e.g., IAM in US-East-1 for AWS). AWS Documentation+1
  • Build your system such that switching or fallback isn’t a massive rewrite — you’ll thank yourself when the outage comes.

In short: spreading risk across architectures and vendors buys you optionality when things go wrong.

4. Design systems to fail gracefully

Let’s face it — failures will happen. The aim is not to pretend they won’t, but to ensure when they do they don’t ruin the user experience entirely. You want containment and graceful degradation, not full meltdown.

Here’s how that looks in practice:

  • Design your system so non-critical features can be disabled, while core features remain usable. For instance: switch to read-only mode rather than full write access when a DB is offline.
  • Use patterns like circuit-breaker and bulkheads: if service A is failing, it doesn’t drag down service B.
  • Implement caching and queuing: if the managed vendor DB is unresponsive, queue requests or serve stale data while you recover.
  • Provide user messaging: If you know some part of the workflow is affected by the vendor outage, tell users: “We’re currently operating in degraded mode due to external vendor issue — some functions unavailable.” That transparency builds trust.
  • Test the degraded mode frequently — so when it’s needed you’re not fumbling.

When you expect failure and plan for it, you’ll feel far less surprised when it happens.

5. Monitor everything && communicate early

You cannot fix what you can’t detect. During a vendor-region outage you may lose the usual metrics or dashboards themselves — so your monitoring plan needs to anticipate that. Also your communication strategy must be ready for external vendor failure, not just your own code bugs.

Key pieces:

  • Monitor vendor status pages (e.g., AWS Service Health Dashboard), but rely more on your own user-facing telemetry: error rates, latency, throughput, traffic patterns.
  • Instrument your application in multiple layers: the AWS vendor control-plane API, your compute resources, end-user symptoms.
  • When an incident hits, act quickly: internal war-room, business stakeholders, customer updates. Acknowledge the issue, explain what you’re doing, don’t wait until you have full answers.
  • After the incident is done, hold a post-mortem: what happened, why did it take so long, what improvement actions come out of it? Document and update your playbooks accordingly.
  • Communicate external to your users: if the reason for slowness is a vendor outage, tell customers what you know and when you expect recovery. Silence or obfuscation harms trust.

In a vendor-outage scenario you are not just an engineer — you’re a responder, a communicator and a business protector.

6. Know what your SLA really covers

It’s tempting to assume a cloud vendor’s SLA will cover you fully. But the credit you receive for downtime is often a small consolation after revenue loss, brand damage or regulatory impact. The October AWS outage highlighted this risk in a big way.

What to look for:

  • Read the fine print: What does “availability” mean for that service? Are they excluding regional failures or only single-AZ issues?
  • Check the vendor’s liability limits: often they cap compensation to service credits, not business loss.
  • Reflect on your own costs: lost revenue, customer churn, reputational risk, regulatory fines — those may dwarf any vendor credit.
  • If you’re building on top of someone else’s platform, remember your SLA to your customers is only as strong as that platform’s reliability.
  • Consider business continuity insurance for cloud-vendor outages if the value at risk is high.

This is about turning vendor marketing promises into a realistic risk-mitigation posture.

7. Practise for failure before it happens

You don’t want your first “big test” to be a live disaster. You need to practise your worst-case scenario. Industry leaders frequently run chaos-engineering drills, fault-injections and war-games.

Here’s how to build the muscle:

  • Simulate a region failure: deliberately shut down resources/take a region offline (in your non-production workload) and test fail-over, data replication and recovery.
  • Kill dependencies: remove access to a managed vendor service (for example, disable DynamoDB in one region) and validate your fallback logic.
  • Restore backups: practice retrieving data from cold backups, alternate regions, second vendor. Know the time-to-recover.
  • Table-top exercise: get your ops, engineering, communications and business teams together and walk through the scenario: “US-East-1 region lost for 12 hours.” Who does what?
  • Update your contingency playbooks based on what you learn during practice.

When you practise failure, it becomes far less frightening when reality hits.

8. Keep backups that can stand on their own

When your vendor has a major failure, you don’t want your backups, replication or control-plane stuck behind the same outage. You need backups that are independent and recoverable.

Consider:

  • Multi-region backups: store snapshots, exports, data in another region or even another vendor entirely.
  • Check recoverability: backups are only useful if you’ve tested restoring them. Verify encryption keys, IAM permissions, secrets and configuration are all intact.
  • Avoid the “single-region trap”: some vendor services (especially global control planes) are anchored in US-East-1 (e.g., IAM)
  • Document and practise your “cold start” scenario: how fast can you bring up a minimal version of the system in another region or vendor?
  • Make sure your backup plan covers all critical dependencies (networking, DNS, authentication, DB, queues).

A backup you cannot use is no better than no backup at all.

9. Decide what really matters

Every part of your system doesn’t need the same level of resilience. You must be explicit about what stays up at all cost, what degrades gracefully and what can wait until the system recovers.

Here’s how to structure that:

  • Tier your services by business impact (Tier 1 = customer-facing revenue-critical, Tier 2 = internal but essential, Tier 3 = nice-to-have)
  • For Tier 1 services, invest in the highest resilience: cross-region, dual-vendor, fail-over tested.
  • For Tier 2 services, design them to degrade gracefully: maybe read-only, reduced functionality, less frequent backups.
  • For Tier 3 services, accept that in a major outage they might pause entirely — document that and set user expectations.
  • Ensure you map dependencies: if your Tier 1 service uses VendorServiceX that only runs in US-East-1, your “Tier 1 resilience” assumption is broken.
  • Communicate internally: make sure product, business and engineering teams understand what is “must survive” vs “nice to have”.

Being deliberate about where you spend your resilience budget means you survive the outage without going broke doing it.

10. Learn from every outage

No matter how resilient you are, some outage will surprise you. The question is: do you treat it as a one-off or as a learning opportunity? The AWS outage reminded us of the fragility of cloud-infrastructure.

What to do:

  • Conduct a proper post-mortem: what failed, why did it take so long, what were the unknown dependencies?
  • Update your architecture diagrams, risk registry, dependency list and playbooks.
  • Re-assess vendor dependency: after the incident ask yourself if you still accept the same risk or whether you need more redundancy.
  • Make conscious trade-offs: If you decide not to invest further resilience, document the decision, the residual risk and get buy-in from business leadership.
  • Share the learning: ensure engineering, ops, product and leadership teams all understand what happened and what we’ll do next.

In doing so, you move from reactive to proactive resilience.

Call-out: Services tied to US-East-1 control-planes

It’s worth flagging specifically that some “global” vendor services aren’t fully region-agnostic. For AWS, examples include:

  • AWS Identity and Access Management (IAM): the control-plane is located in US-East-1, even though the data-planes are regional.
  • AWS Organizations: similarly uses the US-East-1 region for its control plane.
  • Some AWS global endpoints: the “global” AWS API endpoint aws.amazonaws.com routes to US-East-1 unless you specify region.

What this means: even if your workload is hosted in another region, you may still be depending on services that have a control-plane in US-East-1. That gives you a hidden dependency you need to treat like a single-point failure.

Closing thoughts

The AWS outage in October 2025 reminded everyone that “the cloud” is still, fundamentally, someone else’s computer. Yes it is powerful and scalable, but it is not magic. The architectures we build today may survive load, but many are still vulnerable to a single-region failure, or to dependencies we thought were global but are in fact regional.

Real resilience doesn’t come from ticking boxes or buying “multi-cloud” badges. It comes from clarity (knowing what you depend on), design (architecting for failure), practice (running drills and fail-overs), and reflection (learning from incidents). The next outage will happen. The only question is whether you will be scrambling or calmly executing your fallback.

Gary Worthington is a software engineer, delivery consultant, and fractional CTO who helps teams move fast, learn faster, and scale when it matters. Through his consultancy, More Than Monkeys, Gary helps startups and scale-ups improve how they build software — from tech strategy and agile delivery to product validation and team development.

Visit morethanmonkeys.co.uk to learn how we can help you build better, faster.
Follow Gary on LinkedIn for practical insights into engineering leadership, agile delivery, and team performance.