The Global Meltdown: A Look at the CrowdStrike Incident
Introduction
It was 10:33AM Indian Standard Time (IST) on Friday, 19th July 2024 (05:03AM UTC), when an innocuous looking WhatsApp message was received in one of the cyber community groups I am part of; it simply asked “Anyone seeing issue with Windows OS?”
Very soon, it was known that this was an issue with CrowdStrike causing Windows machines to crash with Blue Screen of Death (BSOD). Panic set in, with many CrowdStrike users trying to figure out how to avoid this problem, but there was no escape.
Some reported that CrowdStrike have released a Tech Alert at 05:30AM UTC informing customers that they are aware of this issue and working on it. At this stage, there was no guidance to customers on what to do.
As a result, community members started offering possible solutions. These included:
- Rename windows\system32\drivers\crowdstrike folder and reboot
- Boot in recovery mode, rename csagent.sys and restart
- Few others.
As an unaffected bystander, I was feeling pain and empathy for those affected by this incident. I could see the panic and helplessness all rolled together, and yet a determination to somehow find a solution.
With so many untested solutions being touted, there were concerns on how this will affect the CrowdStrike service itself. Will the machines remain protected once they are booted up?
An hour later (06:30AM UTC), CrowdStrike did publish a work around to delete the C-00000291*.sys file in the C:\Windows\System32\drivers\Crowdstrike directory. However, there was no clarity on whether this will cause the service to stop or make the device unprotected. It’s extremely risky to put servers and workstations on Internet without a working EDR and there were valid questions being asked, but unfortunately no answer was available.
During all this, there was another major outage being reported in Microsoft services and this added to more confusion among the community members.
The other issue was that the suggested workaround was manual, with one frustrated community member suggesting
“workaround is crap. You can’t do it on 10000 servers”.
And then there was the issue of workstations belonging to users who were working remotely, and their machines had BSOD.
People were trying to come up with ideas on how to do this remotely, at scale, as compared to manually working on every affected computer.
There was also discussion on which agent version should one be on, the latest, or N-1 or N-2, and then came the dreaded realization that this outage was related to a “channel file” and did not matter which agent version you were on.
By this time, all hell was breaking loose with aviation sector, banking and many others paralyzed and BSOD screens everywhere.
@CyFi10 created a great thread on the disruption this incident caused.
As the incident response moved on, some realized that BitLocker encrypted drives had their own challenges. Booting in safe mode requires a recovery key and what if the Domain Controllers are also down? There were few instances reported of standalone BitLocker protected machines impacted, where the recovery key was not traceable.
Even 12 hours into the incident, there were many organizations still trying to get back their machines online. A really bad day for IT and Security folks for no fault of their own.
CrowdStrike’s Response
As mentioned earlier, CrowdStrike informed their customers via Tech Alert at 05:30AM UTC that they are aware of the issue. This was updated an hour later to include workaround details, that involved booting in safe mode and deleting C-00000291*.sys file.
No public announcement or acknowledgement. Many of their customers learned about these Tech Alerts through the WhatsApp conversations.
Their X account, https://x.com/CrowdStrike was totally silent throughout all of this. The first public response came at 09:45AM UTC via a tweet from George Kurtz, the CEO and was reposted by CrowdStrike X handle.
The tweet could not have been blander and after reading this, one would not realize that this post was referring to a global meltdown. Replies to the post are quite telling.
At 05:03PM UTC, the apology came via tweet by George and was reposted by CrowdStrike X handle
For the first time, this post referred to a blog where CrowdStrike will provide further updates.
On July 20th, CrowdStrike published a blog post regarding technical details. This claimed that “CrowdStrike released a sensor configuration update to Windows systems” at July 19, 2024 at 04:09 UTC. They also claimed that the “sensor configuration update that caused the system crash was remediated on Friday, July 19, 2024 05:27 UTC”.
As it turns out, that 78 minutes with one bad configuration file, affected approximately 8.5 million Windows devices as informed by Microsoft.
The CrowdStrike guidance page now has lots of good details and answers the questions that were being asked by the security community nearly a day earlier.
CrowdStrike knew the issue at 05:27AM UTC on July 19, 2024 and yet the only advice for their customers was a curt four line manual resolution process, with no clarity on how it affects the protection of their assets or when next updates will be issued.
This led to lots of Fear, Uncertainty and Doubt (FUD) in the community, which sought guidance from each other. And the community performed admirably. Everyone was trying to help each other with whatever information they had or whatever method they had tried and was working in their environment. A few memes were also thrown in for good measure 😊.
In my opinion, this was a failure at multiple levels. CrowdStrike has promised that they are undertaking RCA and will share why this problematic Channel file “C-00000291*.sys” was released in the first place. Considering the widespread impact it had, with nearly all Windows machines being impacted, a good quality check should have caught this out prior release. But we should wait for the RCA to understand where the failure occurred.
My biggest concern is with the response after the file was rolled out. Even after they knew about this at 05:27AM UTC on July 19, 2024, they seemed to try to downplay. Once uproar started regarding the global meltdown, the first public statement came more than four hours later, which further seemed to downplay the incident. The apology came nearly 12 hours later and the technical blog was published after almost a day.
While CrowdStrike may claim it was not a cyber incident (for them), and they are technically right, it was a mega incident for their customers since it adversely affected ‘Availability’.
Business Continuity
Many community members pondered over what could be done to avoid this from business continuity perspective.
Some said, DR should have different AV/EDR whilst others rejected the idea, rightly so. There are so many security tools with deep hooks in the operating system. How many different configurations can one have?
Some said, we should be N-1 or N-2. However, in this case, it would not have helped, since all Falcon Agent versions from 7.11 onwards were impacted.
A possible solution would be to provide a bit more control to the customers to create a gradual roll out policy for updates, just like we typically do for security patches. Something like the feature in Microsoft Defender as elaborated at https://learn.microsoft.com/en-us/defender-endpoint/manage-gradual-rollout and explained wonderfully by Fabian Bader in his blog at https://cloudbrothers.info/en/gradual-rollout-process-microsoft-defender/.
Maybe other security vendors with deep hooks in Operating Systems should have similar features, or maybe they already have. Customers should demand that going forward.
The Fallout
CrowdStrike as a tool or solution is awesome. They rightfully rate amongst the best in EDR/XDR solutions and that’s why they have a very large global footprint. But with just one mistake lasting 78 minutes, they impacted approximately 8.5 million devices and caused a global meltdown.
However unfortunate, this could happen to any vendor. For me, the let down was in their response, which could have been more transparent and way better. I hope they have learnt the right lessons and change for the better.
I wish them all the best and hope they come out of this. They must empathize with their customers, who badly suffered for no fault of their own.
For the affected customers, here are few things to consider:
- If you bulk exported BitLocker keys, do you know who all have the access? It’s a good idea to bulk rotate the keys after incident response is completed. One possible method is https://ourcloudnetwork.com/how-to-rotate-bitlocker-keys-with-microsoft-graph-powershell/
- If you used the local admin accounts to login for deleting the sys file, ensure you rotate them through LAPS or any other tool.
- If you are still resolving, Microsoft has a new recovery tool.
- Beware of the attempts by malicious attackers to cause further damage. CrowdStrike reported early such attempts by ecrime actors and the list of fake domains created after the July 19th incident.
- After you are done with the recovery and rested, check out the memes on this issue on Internet. Some of them are hilarious and quite enjoyable.
My salute to the IT and Security teams who toiled through the day and night to get services back up online. One such story is below
Conclusion
July 19, 2024 will go down in history as perhaps the biggest IT outage for a long time. The NotPetya attack in 2017 was considered to be the most devastating attack until now with more than $10 Billion in damages. By some estimates, it had impacted about 300K devices in 2300 organizations across 100 countries. The CrowdStrike issue was not a cyber attack but impacted 8.5 Million devices. The direct and indirect losses are unimaginable. Hopefully, someone will calculate that in future and I guess, it will beat the damages caused by NotPetya by a huge margin.
I feel that this incident will shake up the industry and critical service providers such as Microsoft and CrowdStrike will have to come up with better and more resilient solutions and processes. This is not just a CrowdStrike issue. All EDR/XDR solutions have deep hooks into the operating systems, push out patches and updates frequently and are susceptible to similar incidents.
The industry must come together to find a better way forward. I am hopeful that we will learn the right lessons from this incident, and as post mortem continues, identify ways to plug the weaknesses.
In the age of AI, it was the human spirit that won. I saw various teams rising to the occasion and putting in their best to restore machines and services, prioritizing the critical ones and community members offering help to others. There was a lot of noise on social media, as expected and lots of incorrect information being passed on as expert conclusions, and lots of memes, which was great.