Thu, Aug 8, 2024

CrowdStrike Incident and Systemic Cyber Risk in the Context of EU Regulations

The recent CrowdStrike incident caused havoc across the planet. A routine update turned into a global issue that impacted approximately 8.5 million computers across virtually every industry including banks, stock exchanges, hospitals and airlines. It’s estimated the associated costs for Fortune 500 companies alone has reached $5.4 billion.

Paradoxically, a highly regarded and sophisticated security company which has been successfully protecting organizations against business interruptions via cyber-attacks for over a decade, inadvertently created what many are calling the biggest IT outage in history. 

In this article, we’ll look at the key factors that led to this incident, the controls that could have prevented it, and why new regulatory requirements, such as DORA and NIS2 in the EU and the Critical Third-Party (CTP) Regime and Operational Resilience Rules in the UK, are designed to reduce the associated risks. 

What Happened?

CrowdStrike is a market-leading security company that offers solutions and services to a wide spectrum of organizations. Their endpoint security monitoring software is used by some of the biggest and most mature companies in the world because they excel at what they do – detecting and preventing malware and other malicious activities in real-time. As with other security vendors, their software is managed and monitored remotely, and relies upon automated updates sent to endpoint computers on a regular basis to ensure they’re kept up to date. This is crucial as new threat intelligence – which CrowdStrike translates into detection signatures – must be deployed as quickly as possible to protect against the latest threats. The problem here is that one of these updates contained a bug that inadvertently caused Microsoft’s notorious Blue Screen of Death (BSOD). In short, it crashed the most popular operating system in the world. 

The catastrophic business interruption that ensued was amplified by the widespread use of Microsoft and CrowdStrike. A confluence of two commonly used technologies created a concentration of risk which is very difficult to avoid. Understanding and quantifying the damage that comes from this will take far longer than the time it takes to recover from this outage, as will the appropriate mechanisms that organizations can put in place to prevent this from happening again. While CrowdStrike quickly published a remediation guide that described the issue and the fix, each organization had to move quickly to understand how to implement it as broadly and quickly as possible, but also, in many cases, ascertain the impact caused to its larger ecosystem. For example, even if they were not impacted directly, their third-party supply chain may have been, with potential knock-on effects including business services disruption. 

Why Did It Happen? 

A routine CrowdStrike update designed to gather telemetry and detect threats on Microsoft Windows systems contained a faulty file that caused operating system (OS) to malfunction. The update was designed to bolster security, but, as with any software updates, they also have the potential to make inadvertent changes to other applications installed on the computer on which they are deployed, including the OS itself. 

This is why robust testing and quality control processes are typically put in place by software developers and vendors, prior to the deployment of updates, be it for a widely used production tool, or an application running in the background. These processes facilitate the detection and correction of bugs and interoperability issues before they are rolled out–and create problems–in production environments. 

In this case, CrowdStrike has disclosed that one of its testing tools had a flaw which allowed the faulty code to pass validation. The undetected bug that impacted Microsoft’s OS–likely the most common OS CrowdStrike is used on–was therefore able to slip past the vendor’s checks and controls. 

CrowdStrike has since confirmed that they have implemented a suite of new and additional checks and safeguards to prevent problematic updates from being deployed in the future. This includes local developer testing, content update and rollback testing, as well as stress and stability testing. They have also implemented a staggered deployment strategy in which updates are gradually deployed to larger groups of endpoints, starting with a small user base.

Regulatory Focus on Resilience and Systemic Risk

This incident clearly highlights what regulators have been worried about - and warning about - for many years: the risks arising from the interconnected nature of modern I.T. infrastructure and the need for proactive resiliency planning and preparedness. 

The associated systemic risk has increased in importance in the eyes of regulators over the last decade or so. Since the 2008 financial crisis, it was acknowledged that tighter oversight was required for the financial industry, as well as for other critical industries / infrastructure. The scope and focus thus broadened beyond financial risk to other types including Digital and Cyber Resiliency (not least because of the increase in large-scale/high-impactful incidents, the geopolitical actors/motivations involved in many cases and the reality of just how interconnected and vulnerable our digital and cybersecurity infrastructure can be, as this latest incident starkly reminded us).

As such, we are seeing this manifest in the form of regulations that focus on digital and cyber resilience, with DORA, NIS2 (both EU-wide in scope) and the UK’s Operational Resilience Rules, three examples that are coming into force in Europe (in January 2025, October 2024 and March 2025 respectively). The EU regulations in particular are the first-of-their-kind, large-scale initiatives to consolidate all the different aspects of Digital and Cyber Risk into broader resiliency-themed regulations. And just like GDPR, it’s expected that similar requirements will be considered for adoption by other regulators around the world. Indeed, U.S. regulators are already engaging with EU/UK counterparts – a key area of focus currently being incident reporting requirements (with one reason being their desire to align, where appropriate, to reduce the burden on global, multi-jurisdictional, companies, having to comply with multiple competing regulations) - and are keeping a close eye on how DORA and NIS2 are implemented.

In the context of this specific incident, there are a few key areas of focus within these regulations that aim to mitigate the associated risks. Kroll highly recommends organizations to review and enhance their capabilities across these domains: 

  • Proactive risk management including regular risk assessments, identification of critical functions and systems and the development of appropriate treatment/ mitigation strategies/frameworks.
  • Robust business continuity, disaster recovery and resilience planning, covering key internal and external systems / services, as well as backup and recovery solutions which are tested on a regular basis.
  • Incident response plans and processes which are clearly defined and tested via various disruption scenarios, including, where appropriate, critical third parties/systems.  
  • Third party/supply chain risk management, including continuous governance, clear communication channels and repeatable risk assessment/mitigation assessments to ensure they meet security standards. 
  • Effective change management / development processes, ensuring rigorous planning, testing and quality assurance controls are embedded into change management and the software development lifecycle.

While DORA and NIS2 are currently getting a lot of attention, there is, of course, another EU regulation on the horizon: the EU Cyber Resilience Act, which will undergo a phased implementation culminating in full applicability by 2027. While its primary focus is on building robust security and vulnerability management mechanisms into vendors’ development and post-sale support processes for products with digital elements, it includes aspects relating to consumers’ rights to approve or opt-out of automatic updates. It also requires vendors to conduct risk assessments across the entire development lifecycle. In the context of this incident, it’s important that this activity is not only seen as a measure to ensure security vulnerabilities are mitigated prior to release, but also that updates are fully tested in various environments for interoperability issues. Firms need robust quality control safeguards in place and incremental rollout procedures in order to prevent such issues from re-occurring on such a global scale. 

Learn how Kroll can help in a global technology outage, such as the CrowdStrike incident, and the implications for businesses. Click here.



24x7 Incident Response

Kroll is the largest global IR provider with experienced responders who can handle the entire security incident lifecycle.

Incident Response and Litigation Support

Kroll’s elite security leaders deliver rapid responses for over 3,000 incidents per year and have the resources and expertise to support the entire incident lifecycle.

Cyber Governance and Strategy

Manage cyber risk and information security governance issues with Kroll’s defensible cyber security strategy framework.