Crowdstrike, or “How to Own the Planet”
I recently wrote about reliable software. I also usually write about cybersecurity and major incidents. Today’s story intertwines both, in a situation so far reaching that, if you tried to write it as the script of the next Bond movie with a villain scheming to cause worldwide chaos, it would fit perfectly.
Let’s look at Crowdstrike’s botched update. And the massive disruption it caused everywhere.
Vacations
It was about midway through Summer, and I was going to go on vacation that same day. When the twitter-sphere exploded into a flurry of messages from IT people talking about BSODs and downed systems, across multiple companies, simultaneously. This was massive, it was unexpected, and it caught most companies during the quiet period where a good chunk of staff is away on well-deserved rest timeouts.
As more reports started to come in, the full scale of the incident could be put into perspective. It was affecting Windows systems, but the common factor was not Windows itself, but rather, an update to Crowdstrike’s EDR product that caused problems. On every single system it was deployed on.
The actual issue
It turns out that, according to Crowdstrike’s information, the problem is a “defect found in a single content update for Windows hosts.”
Sysadmins are reading it as “a botched update just killed my entire Windows fleet and I am going to have to do extra hours to fix this mess.”
So, the silver lining here: if you’re running Linux, your vacations can go ahead as planned (unless your air travel company was impacted).
Reliable software expectations
The expectation that an update to a security tool can break a system and leave it in a state where it can’t be (easily) recovered is a good example of how unreliable software has become. It also says much about the chosen model of running drivers with enough privileges on an operating system that a problem in one can effectively kill the system.
Crowdstrike’s response tries to downplay the incident and spin the story as just one single update that had problems. Sure, but when the problems have consequences all over the world and impact the normal life of millions, downplaying the incident feels out of touch.
Software updates, regardless of what software they apply to, should never cause this. On top of that, given how the only requirement is running Windows and applying the update, it means that it was not tested. It’s not an edge case, and it’s not something difficult to replicate – you just had to apply the update to see the problem. Meaning that there was no testing done. At all. For a security tool running on millions of systems. Also, releasing a completely untested patch to “the wild” should at least happen in stages, with a very restricted group of systems receiving it first, validating it, and gradually rolling it out. While this is a common-sense approach, it’s obviously not what happened.
Impact for users
“Users” of Crowdstrike include airlines, telecom operators, governments, schools, hospitals, heavy industry, gas and oil, trains, communications, IT, …
So you can imagine the result. Flights were delayed or canceled, communications were down, schools couldn’t function regularly, factories had to halt production, extraction of gas and oil had to be interrupted because control and safety systems were offline, trains couldn’t run because they are computer driven, IT departments, the same who were supposed to be fixing the problem, were themselves hit and couldn’t operate until their own systems were brought back up.
In an unprecedented event, Sky News, a major UK television broadcaster, interrupted their transmission.
Stock markets opened late and were suffering losses.
This had more impact than any previous malware infection ever did.
But the biggest blow was to trust. At the time I’m writing this, on the day it happened, Crowdstrike, the company, has had a 20% hit to their stock market valuation. For all affected companies that are themselves considered critical infrastructure, the decision to switch providers or change the architecture to no longer require this tool has to be on the table.
A lifetime to earn trust but just a moment to lose it? Sounds about right.
In a year already ripe with high impact IT incidents, this is yet another example of why we need a better approach to ensuring software reliability and system security. The old processes don’t cut it anymore. That much is obvious.
If you’re running Linux, then you’re feeling pretty smug about it now. And you should. And while it is more difficult to have a system end up in a state like what happened with Crowdstrike, it’s also not impossible. Better approaches to patching, like Live Patching, are easier to set up, test, deploy in different groups to ensure expected results, and much easier to manage. So even if you’re running Linux, if you’re still doing things like you did 20 years ago, there’s still room to improve.
Now back to those vacation plans I had…
[If you missed the reference in the title, “Stealing the Network: How to Own a Continent” is a hacking book that is an interesting, if controversial, read]