Large parts of the internet were disrupted July 17, human error to blame
It’s not been the best of weeks as far as cybersecurity goes. A critical Windows Server worm emerged that caused the Department of Homeland Security to issue an emergency directive ordering federal agencies to update. Then there was the Twitter hack, of course. So when a whole bunch of popular websites goes down at much the same time, many people will assume that a cyber-attack is underway.
That’s what happened during Friday, July 17, when access to sites such as League of Legends, Deliveroo, Discord, Feedly, GitLab, Medium, Patreon, Politico and Shopify was disrupted.
Twitter was suddenly full of users reporting that the internet was down and asking what the heck was going on.
It turned out not to be nation-state threat actors hitting some internet kill switch, but rather a problem with one of the largest companies providing secure domain name system (DNS) services and denial of service protection.
That company is Cloudflare, whose homepage proudly declares that it helps “keep thousands of business online.” So, what went wrong?
What went wrong at Cloudflare?
Cloudflare is used to defeating threat actors and regularly protects customers from massive distributed denial of service (DDoS) attacks. These attacks are ever-increasingly sophisticated, often throwing large resource loads at Cloudflare’s routers and appliances to take sites down. So, perhaps unsurprisingly, the first thought for some was that an attacker had succeeded this time.
That was not the case.
The outage, which started at 9:12 p.m. UTC, was caused by human error. In a July 18 blog entry, Cloudflare CTO John Graham-Cumming said that the cause of the 50% drop in traffic across the network, and the subsequent internet outages, was “a configuration error in our backbone network.”
It appears this was, in effect, a bigger version of someone tripping over the server power supply plug and pulling it out.
The Cloudflare engineering team were, Graham-Cumming said in the admirably transparent posting, working on an issue with a segment of the network backbone and updated a router configuration in Atlanta to, ironically, alleviate congestion.
“This configuration contained an error that caused all traffic across our backbone to be sent to Atlanta,” Graham-Cumming said, “This quickly overwhelmed the Atlanta router and caused Cloudflare network locations connected to the backbone to fail.”
Which meant that connections in 20 locations across the world were impacted: San Jose, Dallas, Seattle, Los Angeles, Chicago, Washington, DC, Richmond, Newark, Atlanta, London, Amsterdam, Frankfurt, Paris, Stockholm, Moscow, St. Petersburg, São Paulo, Curitiba, and Porto Alegre.
Cloudflare’s transparency to be applauded
Cloudflare CEO, Matthew Prince, who recently became a new billionaire, tweeted that the company had now “applied safeguards to ensure a mistake like this will not cause problems in the future,” while confirming it was a simple, but costly, typo in that router configuration that caused the problems.
Director of network engineering, Jerome Fleury, said that there were “lots of lessons learned” and invited people on Twitter to ask hard questions of him.
The outage itself lasted only 27 minutes, an eternity to the average internet user, but the resulting core network congestion meant that disruption to services continued for almost an hour in total.
“We are sorry for the disruption to our customers and to all the users who were unable to access Internet properties while the outage was happening,” Graham-Cumming said, “We’ve already made changes to the backbone configuration to make sure that this cannot happen again, and further changes will resume on Monday.”
The consequences of this configuration error were global and will, no doubt, have dented Cloudflare’s reputation. The timely openness and transparency shown by Cloudflare executives on Twitter and the Cloudflare blog should help bash that dent out a bit.