Six Weeks In: Lessons Learned from the CrowdStrike Post-Mortem

Written by David Feltes | Aug 22, 2024 1:11:10 PM

My father, he of a military background, is the didactic sort, akin to the econ professor you didn’t want to get in Uni. He was forever forcing his favourite aphorisms down my adolescent throat, hoping, I suppose to one day make wisdom pâté. One of his most notable was the “Six Ps”: prior planning prevents piss poor performance.

This summer, after seeing the effects of cybersecurity firm CrowdStrike’s failed update of the Falcon Sensor impacted over 8 million computers running MS Windows, I found myself regurgitating Dad’s ‘Six Ps’. My macabre curiosity spiked on the causes and impacts of what is now the largest IT incident in history. My job is to study and improve operational resilience, a topic of which my colleagues and I at RocketFin are geekily wonky.

Initially, millions of Windows machines crashing (whilst Linux and Mac users smugly tweeted their problem-less resilience) caused historic BillGatesian and anti-MS scorn. My favourite meme had the Microsoft 365 logo, with an X over the “5” and a “4” scribbled on top. But soon it became apparent that a 3^rd party vendor for cyber threats had triggered Windows-specific crashes. For most firms, it is difficult to control their own processes; when one out-sources a capability, how can a responsible manager be sure those systems are resilient?

Regulation to the Rescue

Of course, global regulators have spent the last decade wrestling with modern business operational impacts, particularly with the uncertainties involved with sophisticated cyber-attack and increased concentration risk on centralised systems. The UK has rolled out its Operational Resilience Framework, and our EU neighbors have technocratically rolled out Digital Operational Resilience Act, affectionately known as DORA. As prudential stewards, the regulators are concerned with systemic risk (think Lehmans), financial risk (think Barings) or customer harm (think little old lady unable to get £10 from an ATM). Companies are still trying to bed-in their resilience strategies into BAU, and integrate a holistic approach to resilience testing, monitoring, reporting and governance. And regulators, hot off supply line pressures during the COVID pandemic, have their sights firmly set on 3^rd party resilience.

Now that the dust has settled somewhat, we can evaluate the initial CS post-mortem and Root Cause Analysis (RCA) report to determine lessons learned that can be applied to improve prevention and recoverability, as well as get the sense of response from the authorities.

'What Happened?'

On July 19, CrowdStrike, a cybersecurity firm and MS vendor, released an upgrade to their clients’ cloud cyber capability sensors. The CS team had tested and stressed the upgrade and rolled out similar upgrades in H1 2024. Alas, the new fix provided 21 sensor input fields, when the sensor only expected 20, causing an out-of-bounds memory read, which forced a Windows system crash and the horrific blue screen of death. The BSOD causes an instant adrenalin panic to any IT manager, and the subsequent computer failures caused colossal service lost across multiple industries and massive down-time across the globe.

Fast Stats on Impact

First, let’s look at the damage estimates-

~8.5 Windows devices were hit (as per CS 8-K filing)
Global losses estimated at $15B (as per Parametrix)
Full closure of US government agencies, like Social Security and Education departments (did someone says systemic?)
Over 5,000 commercial airline flights were cancelled
Recovery time: while some IT firms were able to install the fix quickly, it took 5 days to get to 97% sensors back online
Concentration risk of suppliers: According to SecurityScorecard, 15 firms have 62% market share
CS stock dropped 30%, but has recovered and now is down ~10%

What was the major screw up?

Two major issues:

CS testing and rollout program had defects, which prevented the field errors from being discovered in testing.
The major error: performing a fork-lift upgrade on all clients simultaneously. Following the ease of earlier 2024 upgrades (and a desire to push out new tech quickly), CS became cocky and moved away from industry best practice of a phased, or tiered, update process.

What did CS get right?

Communication

Messaging Out: After breaking it, the CrowdStrike team rapidly found the fix, and quickly communicated with impacted firms and helped fix and restart.
Messaging Up: As the issue was with cyber software, management quickly issued a Post-Incident Review, signifying the error was not a cyber incident but a failed upgrade. This allowed many-a-CISO to focus on recovery rather than quarantine.

What is CS doing to improve?

CS has hired 3^rd parties to perform code and process reviews and have already changed their software testing protocols. Most importantly, future upgrades will be via ‘canary deployments”, wherein a gradual tiered roll-out will allow the company to diminish systemic impacts. This is the most obvious lesson learned.

What should you be doing in your firm?

Most firms have strong business continuity processes (hat tip, 9/11), and adequate operational risk metrics.

Your firm should have a developed resilience regulatory narrative pack at the ready that delineates your continuity and resilience strategy, capabilities, and actions.

This pack should include:

Strategy statement and resilience policy: A clear, short corporate resilience mission statement of your firm’s approach to recovery written in non-jargon for Board member and regulator to understand.
Rapid Response Team: Who is the team that will enable recovery? Clear response teams and decision trees are essential.
Scenario playbooks: Clearly delineated playbooks for response teams to mitigate customer harm and diminish losses.
Exercise Scenario Library: 10-20 extreme but plausible events (e.g. power outage, pandemic, cyber-attack,….) that test your firm’s impact tolerance, including industry wide events (SIMEX), 3^rd party events (ORYX) and internal desktop exercises that show management and the regulators you have thought about resilience. A typical firm should be exercising at least 5 times per annum.
Continuous Monitoring KRIs: Ongoing monitoring can detect threats before they materialise into incidents. Does your firm have automated Key Risk Indicators (KRIs) that enable a resilience dashboard?
3^rd Party Risk Assessments: All 3^rd party (from your cloud provider to the digital app that connects to the atomic clock) should have SLAs and legals in one place, with clear communication channels in place.
Threat Intelligence Integration: Utilizing threat intelligence allows organisations to stay ahead of emerging threats.

How can we help?

Regulators are increasing focus on proof of action a priori, and your team must be ready for the always fun email requesting a regulator meeting with your boss in two days.

CrowdStrike had been a Wall Street darling, yet a small field error caused widespread global impacts on real people. It was an existential-level event, and worth studying further.

Because as Dad also preached, “Fool me once, shame on you. Fool me twice, ….”

If you want to know more, and continue the geek-dive, drop me a line at david.feltes@rocketfin.co

David has 20+ years of experience in leading exchange and trading operations and has held various senior roles in consultancy firms, specialising in Operational Resilience. David was instrumental in delivering the SIMEX22 market-wide exercise and most recently led the ORYX market-wide operational resilience exercise with UK Finance.

View full post