Earlier this month, Fastly suffered an outage that affected 85% of its network. Turns out, computers are hard, one customer managed to bring down the internet with a valid change.
We’re going to take a break from the mundane cybersecurity barrage of news and information. Not only does it feel like tech news is a broken record with all of the cybersecurity issues, but it seems that the answer is always the same. Security can’t wait. Proper testing of apps and services is imperative prior to launch or deployment. Bring in an expert. Etc. Have you pictured Ben Stein yet? No, today we’re going to take a break from the monotony and remind ourselves that computers are hard. Earlier this month, Content Delivery Network (CDN) Fastly suffered an outage that affected a large portion of the internet. Turns out, that outage was caused by a single customer configuring his internet connection.
Yep, one person took down 85% of Fastly’s network. He merely triggered a bug in the latest software update by providing the exact conditions it needed to activate. And that customer wasn’t even doing anything wrong, it was a valid configuration. It just happened to cause a major problem. From Fastly’s blog, written by Nick Rockwell, Senior Vice President of Engineering and Infrastructure:
“On May 12, we began a software deployment that introduced a bug that could be triggered by a specific customer configuration under specific circumstances.
Early June 8, a customer pushed a valid configuration change that included the specific circumstances that triggered the bug, which caused 85% of our network to return errors.”
According to the timeline posted, the problem was detected within one minute and in less than an hour (49 minutes, to be exact), 95% of the network was operating as normal. Fastly has already created a patch and deployed it, but it does not lessen the impact. Customers of Fastly include news media outlets like CNN, The New York Times, BBC as well as streaming outlets like Twitch, Reddit, Spotify and more. All of which saw down time during this outage. Which means that, despite that it happened before 6am EST, people noticed.
This is yet another instance that illustrates the difficulty faced when dealing with computers and code. This particular bug was undetected in all pre-deployment assessments, which really just means that the exact parameters needed weren’t used in testing. It’s not that it should have been checked, it’s that it’s impossible to account for the infinite number of possibilities that can happen. The update Fastly sent out on May 12 seemed to be just fine until June 8, almost a month before someone gave the system what it needed to shut down.
Computers are hard. As consumers, we expect everything we do on a computer, or with any technology really, to just work like it’s supposed to work. We expect systems to do things they may not be designed to do, and then we blame the developer when something breaks because some exponentially infinitesimal possibility actually happened. We must remember that technology in any form is just a machine. It’s a tool. Today’s tools can certainly do a lot of work for us, and we rely on them daily, as we should. But every tool has a limit, and these particular tools are wildly unpredictable.
They are unpredictable because we are only on the cusp of a technological revolution. Some might say we are already in the middle of the revolution, but the rate at which we innovate and develop new tools and learn new ways to advance the human race, the revolution has only begun. Humans are competitive, we want to be the best at what we do and we want recognition for it. That, spurred by the desire to figure out the latest puzzle, will see us continue on this path. But we must always remember that computers are hard. Sometimes, when you hammer a nail on the 3rd floor, something breaks on the 12th floor. The best way to tackle this is to understand that things are going to break and have a backup plan in place for when it does.