I&O Resilience Is Not Optional

Infrastructure failures are going to happen, I&O leaders need to understand how dependencies may cause outages to critical business processes, take steps to reduce the likelihood and impact, and engage with stakeholders and executives to agree on acceptable risk levels and mitigation strategies.

I&O Resilience Is Not Optional

The last three months of 2025 have seen multiple major Internet outages resulting from issues at AWS, Azure and Cloudflare. These events had significant impact to services provided by banks, technology, media, communications, and many other organizations. In some cases, the outages resulted in a diminished level of service, in others, organizations were unable to conduct business until the issues were resolved.

In his post Why Technology Resilience Isn’t Optional Anymore, Chris Steffen (VP of Research at EMA) ties the learnings from these outages back to the need for technology resilience that enables critical business processes to satisfy customer expectations and regulatory requirements. Chris highlights the need for a strategic approach to build and maintain systems that can continue to meet business needs in the event of service outages.

Where to Start on your Resiliency Journey:
Tech resiliency is a continuous and ever-changing journey, not a one-and-done project. When embarking on this journey, consider these three things:

  1. Identify & Prioritize Critical Workloads and Infrastructure. You can’t protect everything equally. Pinpoint your mission-critical applications and data – the systems that must run for your business to operate. Define clear recovery time objectives (how fast you need to be back up) and recovery point objectives (how much data loss you can tolerate) for these crucial assets.
  2. Embrace Redundancy and Automation. Don't put all your eggs in one basket. This means more than just backups; think multi-region or multi-cloud deployments so a failure in one location doesn't take you offline. Leverage automation to detect issues and seamlessly failover to a healthy system in minutes, not hours.
  3. Test, Test, and Test Again. A recovery plan that hasn't been tested is just a nice-sounding theory. Implement regular resilience testing. Consider "chaos engineering" in which you intentionally introduce small failures in a controlled environment to see how your systems (and teams) react. This helps you find weaknesses and build muscle memory before a real crisis hits.

I would extend Chris’s plan to include communications with the executive and business teams about possible sources of outages and what’s being done to limit negative impact.

(4) Discuss Risks and Mitigation Approaches with Stakeholders. Business leaders and C-level executives do not like being surprised with bad news. Assess the potential for business process impact caused by dependencies on external providers and possible mitigation strategies. Have a discussion with stakeholders about outage likelihood and impact, come to agreement on what risk level is acceptable given the importance of the business process[1], and create a joint runbook for how the organization will operate if an outage occurs.

The major cloud service providers will continue to improve the reliability and resilience of the services that they provide, but there will always new scenarios that cause outages. Understanding how the dependencies that you don’t control can impact critical business processes is an important part of leading an I&O organization.


  1. Getting to zero risk and no downtime can be very expensive. ↩︎