My father, he of a military background, is the didactic sort, akin to the econ professor you didn’t want to get in Uni. He was forever forcing his favourite aphorisms down my adolescent throat, hoping, I suppose to one day make wisdom pâté. One of his most notable was the “Six Ps”: prior planning prevents piss poor performance.
This summer, after seeing the effects of cybersecurity firm CrowdStrike’s failed update of the Falcon Sensor impacted over 8 million computers running MS Windows, I found myself regurgitating Dad’s ‘Six Ps’. My macabre curiosity spiked on the causes and impacts of what is now the largest IT incident in history. My job is to study and improve operational resilience, a topic of which my colleagues and I at RocketFin are geekily wonky.
Initially, millions of Windows machines crashing (whilst Linux and Mac users smugly tweeted their problem-less resilience) caused historic BillGatesian and anti-MS scorn. My favourite meme had the Microsoft 365 logo, with an X over the “5” and a “4” scribbled on top. But soon it became apparent that a 3rd party vendor for cyber threats had triggered Windows-specific crashes. For most firms, it is difficult to control their own processes; when one out-sources a capability, how can a responsible manager be sure those systems are resilient?
Of course, global regulators have spent the last decade wrestling with modern business operational impacts, particularly with the uncertainties involved with sophisticated cyber-attack and increased concentration risk on centralised systems. The UK has rolled out its Operational Resilience Framework, and our EU neighbors have technocratically rolled out Digital Operational Resilience Act, affectionately known as DORA. As prudential stewards, the regulators are concerned with systemic risk (think Lehmans), financial risk (think Barings) or customer harm (think little old lady unable to get £10 from an ATM). Companies are still trying to bed-in their resilience strategies into BAU, and integrate a holistic approach to resilience testing, monitoring, reporting and governance. And regulators, hot off supply line pressures during the COVID pandemic, have their sights firmly set on 3rd party resilience.
Now that the dust has settled somewhat, we can evaluate the initial CS post-mortem and Root Cause Analysis (RCA) report to determine lessons learned that can be applied to improve prevention and recoverability, as well as get the sense of response from the authorities.
On July 19, CrowdStrike, a cybersecurity firm and MS vendor, released an upgrade to their clients’ cloud cyber capability sensors. The CS team had tested and stressed the upgrade and rolled out similar upgrades in H1 2024. Alas, the new fix provided 21 sensor input fields, when the sensor only expected 20, causing an out-of-bounds memory read, which forced a Windows system crash and the horrific blue screen of death. The BSOD causes an instant adrenalin panic to any IT manager, and the subsequent computer failures caused colossal service lost across multiple industries and massive down-time across the globe.
First, let’s look at the damage estimates-
Two major issues:
Communication
CS has hired 3rd parties to perform code and process reviews and have already changed their software testing protocols. Most importantly, future upgrades will be via ‘canary deployments”, wherein a gradual tiered roll-out will allow the company to diminish systemic impacts. This is the most obvious lesson learned.
Most firms have strong business continuity processes (hat tip, 9/11), and adequate operational risk metrics.
Your firm should have a developed resilience regulatory narrative pack at the ready that delineates your continuity and resilience strategy, capabilities, and actions.
This pack should include:
Regulators are increasing focus on proof of action a priori, and your team must be ready for the always fun email requesting a regulator meeting with your boss in two days.
CrowdStrike had been a Wall Street darling, yet a small field error caused widespread global impacts on real people. It was an existential-level event, and worth studying further.
Because as Dad also preached, “Fool me once, shame on you. Fool me twice, ….”
If you want to know more, and continue the geek-dive, drop me a line at david.feltes@rocketfin.co
David has 20+ years of experience in leading exchange and trading operations and has held various senior roles in consultancy firms, specialising in Operational Resilience. David was instrumental in delivering the SIMEX22 market-wide exercise and most recently led the ORYX market-wide operational resilience exercise with UK Finance.