Everyone has seen the recent news of the outage caused by a bad update to the CrowdStrike Agents. This is a good time for everyone to take stock of their own IT systems, disaster planning, and security configurations. First, if you follow Sher-Tech you know that we are generally don’t recommend companies use services like CrowdStrike, because of the danger of this very incident and how catastrophic it can be. It will take a lot of time and manual work for organizations to recover their full system and bring every last endpoint online. There is certainly a business case where the risk of such a catastrophic outcome is outweighed by the more general risks presented by remote workers, work from home, and general employee security configurations. Large enterprises with tens of thousands of endpoints need automated tools to assist them with managing such large systems. However, everyone should be prepared for an event where a third-party system has an issue which causes a total loss of services, in this case, the actual servers and endpoints. Recovering quickly from such a catastrophic event is the responsibility of every board member and executive manager. I strongly encourage everyone to review their own disaster planning and IT systems to help identify single points of failure such as the one presented in today’s CrowdStrike incident.
Secondly, I want to take a moment to highlight some of the ways something like this happens. To be clear, I have no insider knowledge of CrowdStrike and no specific knowledge of how this incident occurred. However, I will point out that this incident could likely have been prevented by simple, human testing in a sandbox environment. More and more I see software companies moving to AI and other “automated” code review and testing practices before pushing production code into development and onto customers. This is inadequate and fails to capture how software and hardware interact with each other and how various other software installations can interact with each other. Given that as far as I am aware every single CrowdStrike endpoint is offline, this is a software issue that should have been caught before entering production. But that’s not what I want to focus on, what I want to focus on is how a business, especially a large business like CrowdStrike, falls into these errors.
We have spoken at length on our podcast and in our blog about culture, and how important corporate culture is to business security. Cutting costs on testing and sandboxing before delivery is an example of short-sighted culture. Whatever the cause of this specific incident, I am certain that the root cause is a failure of leadership from the top down. Someone has prioritized speed and cost savings over good code, best practices, and a good product. I hope that rather than blaming some poor developer, senior management takes a good hard look in the mirror at how they have fostered a culture that permitted this to happen. Your reputation is the single most important and valuable asset you have as a company. Protecting your reputation should be your number one priority, no matter the cost, and no matter the time, because once lost, it takes years to recover. The single most important take-away for everyone from this incident is that if it can happen to a business like CrowdStrike, it can happen to you, and you may not have the resources CrowdStrike has to recover. Everyone should be taking a look at their corporate culture and ensuring they have the checks and balances in place to prevent, detect, and correct human error, verify the quality of products, and foster strong teams that work together to protect the organization and your reputation.