This weekend’s global IT outage caused by a software update gone wrong highlights the interconnected and often fragile nature of modern IT infrastructure. It demonstrates how a single point of failure can have far-reaching consequences.
The outage was linked to a single update automatically rolled out to Crowdstrike Falcon, a ubiquitous cyber security tool used primarily by large organisations. This caused Microsoft Windows computers around the world to crash.
CrowdStrike has since fixed the problem on their end. While many organisations have been able to resume work now, it will take some time for IT teams to fully repair all the affected systems – some of that work has to be done manually.
How could this happen?
Many organisations rely on the same cloud providers and cyber security solutions. The result is a form of digital monoculture.
While this standardisation means computer systems can run efficiently and are widely compatible, it also means a problem can cascade across many industries and geographies. As we’ve now seen in the case of CrowdStrike, it can even cascade around the entire globe.
Modern IT infrastructure is highly interconnected and interdependent. If one component fails, it can lead to a situation where the failed component triggers a chain reaction that impacts other parts of the system.
As software and the networks they operate in becomes more complex, the potential for unforeseen interactions and bugs increases. A minor update can have unintended consequences and spread rapidly throughout the network.
As we have now seen, entire systems can be brought to a grinding halt before the overseers can react to prevent it.
How was Microsoft involved?
When Windows computers everywhere started to crash with a “blue screen of death” message, early reports stated the IT outage was caused by Microsoft.
In fact, Microsoft confirmed it experienced a cloud services outage in the Central United States region, which began around 6pm Eastern Time on Thursday, July 18 2024.
This outage affected a subset of customers using various Azure services. Azure is Microsoft’s proprietary cloud services platform.
The Azure outage had far-reaching consequences, disrupting services across multiple sectors, including airlines, retail, banking and media. Not only in the United States but also internationally in countries like Australia and New Zealand. It also impacted various Microsoft 365 services, including PowerBI, Microsoft Fabric and Teams.
As it has now turned out, the entire Azure outage could also be traced back to the CrowdStrike update. In this case it was affecting Microsoft’s virtual machines running Windows with Falcon installed.
Editor’s note: At the time of writing, reports suggested the Microsoft Azure outage was also caused by the CrowdStrike error. Microsoft has since confirmed these were unrelated events, and the Azure issue has “fully recovered”.