Stress Fracture: What the massive Rogers network collapse can teach us about stress testing in complex systems

Last week, most of Canada was rocked by a massive cellular and internet outage affecting millions of Rogers Telecommunications customers from coast to coast.  More than just a mere nuisance, the network collapse affected many essential services, from credit card and e-banking to 911 emergency services.  Reports of people being unable to reach ambulance support without access to a landline were among the most harrowing details to arise from this now week-long event. 

Although a fulsome explanation of how such a catastrophic failure could occur is still lacking, two concerning realities seem pretty clear: nobody saw this coming, and the system is ill-prepared to prevent similar outages in the future. This has everyone from governments to consumer advocacy groups calling for proactive steps to better protect the national telecom infrastructure. The irksome related question then arises: how do you prevent something that you didn’t see coming in the first place?

Stress Testing

Complex systems like a telecommunications network or hospital are not designed to work well at maximum capacity. Toronto’s Pearson International Airport is crippled of late by a perfect storm of a surge in passenger capacity coupled with pandemic-related process delays and staffing shortages, leaving a hellish landscape of travel-related nightmares in its wake. Health care is particularly ill equipped to manage large surges in user demand (we even have a word for it: a “code orange” or mass casualty event, where demands outstrip available resources). This doesn’t jive well with the reality that most health care systems are strained at or beyond capacity, with an associated slow (and at times dramatic) erosion in capacity and access to care.

It is surprising then that organizations that deal with emergent complexity spend so little time studying their system at the extremes. Stress testing describes exactly that: the study and analysis of how a system performs at the brink of collapse. Like a treadmill test pushes the heart beyond its aerobic capacity to detect illness, stress testing systems sheds light on potential system vulnerabilities and helps generate viable solutions.

The Pre-Mortem and Prospective Hindsight

Cognitive psychologist Gary Klein described the Pre-Mortem as a method for gaining insight — or more accurately, prospective hindsight — as to the reasons why a system, project or idea might fail. The process involves a group of stakeholders, briefed in advance that failure has occurred, brainstorming a list of possible reasons why things didn’t go as planned. The list of hypothetical explanations for the hypothetical catastrophe are tabulated and codified, and used to prospectively bolster the implementation plan. The Pre-Mortem is powerful because it harnesses the collective expertise of a project team, allowing the group to consider possible outcomes from future events before a decision is made, rather than scrambling after the fact to pick up and analyze the pieces. The process works best when all stakeholders, of all ranks and diversity of opinion, are given an equal voice; An effective Pre-Mortem is antithetical to the notion of an echo chamber for leaders to bask in the warmth of their own untested ideas.

How Simulation Can Help

Simulation is series of techniques used to re-create a situation or environment, allowing participants to experience a representation of a real event for the purpose of understanding systems or human actions. To quote simulationist and Advanced Performance co-founder Andrew Petrosoniak, simulation provides a free look at the future. The “free” in this case is not free of cost; when it comes the design of complex systems there is no free lunch, only thoughtful investment in tools that promote safety, efficiency, and user-centred design. Rather the “free” refers to gathering prospective hindsight without placing actual people, places or things in jeopardy.

Simulation provides a near limitless range of possibilities for examining a system under stress, toggling variables and outcomes in a controlled manner and examining the output. Stress testing using simulation can take many forms, from talk through exercises and table-top discussions to full scale, in-situ simulations that take place in the actual workplace. Imagine an emergency department bursting at the seams with patients, its leadership struggling to implement solutions ahead of the next wave of COVID-19. Waiting until the surge to arrive is as unethical as it is impractical. A better strategy could involve simulating a variety of possible failure scenarios and using the prospective hindsight to generate and implement data-informed solutions.

Simulation-informed system design builds on Klein’s notion of the Pre-Mortem by allowing stakeholders not just to talk through, but to physically test and experience a multitude of possible future scenarios. Rather than simply identifying problems, simulation provides an opportunity to test and vet possible solutions, with powerful implications for informing future decision making.

Christopher Hicks is an emergency physician, trauma team leader, and co-founder of Advanced Performance in Toronto.