Why the Tech Industry Refuses to Learn From Disastrous Outages

Pictures of stranded travelers at airports looking mournfully at rows of monitors frozen on the same Windows “blue screen of death” may look like a new level of tech dystopia. But the fallout from Friday’s worldwide outage sparked by a botched update from the security provider CrowdStrike is better viewed as a sequel to a movie we’ve seen before—which like so many cinematic productions is just the latest installment in an ongoing series.

Tech experts have warned for decades against relying too much on any one company’s software or services, lest their potential blast radius grow too large when attackers exploit a vulnerability or something else goes wrong.

Back in 2003, seven computer security experts wrote a paper titled “CyberInsecurity: The Cost of Monopoly”—commissioned by the Computer & Communications Industry Association, a Washington tech trade group whose members objected to Microsoft’s abuses of its market power—to outline that risk.

“Most of the world’s computers run Microsoft’s operating systems, thus most of the world’s computers are vulnerable to the same viruses and worms at the same time,” they wrote. “The only way to stop this is to avoid monoculture in computer operating systems, and for reasons just as reasonable and obvious as avoiding monoculture in farming.”

A decade ago, Columbia University law professor and tech policy expert Tim Wu wrote in these pages that the Heartbleed web vulnerability showed how tech monocultures were not only a big-company problem. That flaw in an open-source component that helps encrypt connections to web pages didn’t implicate Google or Microsoft and instead resulted from a distributed team of developers missing a bug for two years. “As we centralize more, and put more of our lives online and into consolidated accounts, the damage from being compromised is greater,” wrote Wu, an adviser to the Obama and Biden administrations.

And last month, a tech executive speaking at an event in Washington said the industry needed to stop pretending this risk doesn’t exist: “We can no longer tolerate solutions or architectures that risk crumbling from a single point of failure.” That speaker was CrowdStrike’s vice president and counsel for privacy and cyber policy Drew Bagley, who gave a talk sponsored by the Austin, Texas, company at a Washington Post “Securing Cyberspace” event on June 6.

The CrowdStrike crisis, however, flips the monoculture script a little: Instead of an overlooked vulnerability rendering entire companies or even industry sectors open to hacking—see, for example, contemporary ransomware attacks—a defensive system wound up attacking its hosts. In CrowdStrike’s case, a botched automatic update had a bug that intersected catastrophically with privileges Microsoft built into Windows for security tools, sending P.C.s and servers into failed reboots.

CrowdStrike does not have close to a Microsoftesque lock on the market—it holds only 18.5 percent of the endpoint-security market in the second quarter of 2023, per data from the market research firm Canalys. But that still represents a nontrivial chunk of the IT market. Experts are already calling Friday’s incident “the largest IT outage in history.”

“The incident involving CrowdStrike is a good reminder of how interconnected our technology is today with impacts extending beyond the digital realm,” wrote Brandon J. Pugh, director of cybersecurity and emerging threats at the R Street Institute, a Washington think tank. He called them “multiple examples of how intertwined some IT and cyber products have become across the globe and how a problem with one can lead to effects well beyond its core offering.”

And something similar could happen with another security vendor’s product, especially if it interacts with Microsoft’s exceptionally widely used Windows in similar ways. “It was a perfect storm of a faulty update presumed safe and deployed automatically at scale,” wrote Katie Moussouris, founder and CEO of Luta Security. “This could happen with any security content update from any vendor.”

She predicted that companies will respond by subjecting automatic updates from security vending to the same testing as nonsecurity-software revisions—“a whole new daily testing task for overburdened IT departments to prevent recurrence in the future.”

Then again, tech monoculture isn’t just a security problem either. Google’s overwhelming share of the search and advertising markets—now the subject of multiple antitrust lawsuits—leaves websites and publications dangerously vulnerable to even minor changes to its systems. Meta’s outsize share of social media means its content-moderation systems misfiring can silence people across multiple networks and make a page or even an entire site unsharable across its properties. And on an individual scale, an iPhone can unlock so many apps and services that thieves have become aggressively inventive at finding ways to ensure they steal not only a victim’s phone but their screen-unlock passcode.

But while it can be easy to find examples of how wide use of any one tool can cause cascading problems, finding a fix for them is much harder.

There’s widespread agreement that resiliency is a worthy goal—as the White House’s acting national cyber director Kemba Walden said in a talk at the Black Hat security conference in Las Vegas last August, “We have to invest in the resilience of cyberspace.”

Inconveniently enough, resiliency often not only translates to inefficiency but requires embracing it as a virtue.

“This drive for efficiency leads to brittle systems that function properly when everything is normal but break under stress,” security researcher Bruce Schneier, one of the authors of the 2003 paper, wrote a few months into the pandemic in 2020. “If we want to be secure against these crises and more, we need to add inefficiency back into our systems.”

Wu offered a similar prescription to TNR readers in his 2014 piece, endorsing “more diversity and more competition at every level, even among encryption standards.”

Pugh advised that reducing monoculture risk “requires having redundancies in place and actually testing and training on them should a disruption occur, regardless of what the cause might be.”

Selling that to shareholders can be a stretch, though.

“Individual companies or organizations will have a difficult time fighting the underlying economic and business operational forces that drive the level of IT concentration,” said Michael Daniel, president and CEO of the Cyber Threat Alliance, in an emailed statement. “The benefits that flow from interoperability, standardization, and scale are substantial and drive firms to utilize a small set of vendors.” Luta Security’s Moussouris went further, calling monoculture an “inevitable reality.” As she wrote: “There are only a few operating systems, so we’re effectively in a tiny gene pool of base software no matter what.”

A former cybersecurity adviser to Obama, Daniel endorsed having government set standards—something the Biden administration has attempted to do without the help of legislation through such workarounds as adding stronger security requirements to government IT contracts.

“The specific problem of concentration risk will likely require government action to address, given the underlying economic and business structures,” he wrote. “Such actions could include minimum interoperability standards, regional segmentation, and graceful functional degradation.”

In other words: If you can’t have a backup for everything, have a plan to limit the damage when things do go sideways. Because if we’ve learned anything over the last couple of decades of software misfortunes, it’s that there’s always a next time.