Amazon experienced an outage affecting users on a global scale. The proof is in the pudding, computers are hard, even for the experts.
Technology is a beautiful and wonderful thing. That’s a pretty definitive statement, so let’s clarify. Technology is a beautiful and wonderful thing, when it works like it’s supposed to work. The problem with technology, even though we humans created it, is that we still don’t fully understand what we have created and are creating. We don’t always know the implications a piece of technology will have or if it’s sustainable until we integrate it and stress test it. Cloud outages are exceedingly rare, but the latest AWS outage proves that computers really are hard.
On the morning of December 7, AWS began to experience problems affecting multiple platforms reliant on its services. From their post-event message shared on their website:
“At 7:30 AM PST, an automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network. This resulted in a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for communication between these networks. These delays increased latency and errors for services communicating between these networks, resulting in even more connection attempts and retries. This led to persistent congestion and performance issues on the devices connecting the two networks.”
Basically, a process that normally takes place as it should without causing any problems suddenly didn’t perform as expected. AWS has processes within its infrastructure to scale servers as needed, automatically. This is the elasticity of the cloud and why the businesses which use it can scale their own brands. Again, it’s a very normal process that occurs as needed. But for whatever reason (innocent code change, hidden bug exposed by edge case, unexpected surge in use…etc.), this time it encountered a problem. That problem resulted in congestion which was only made worse by users refreshing their Netflix, Disney+, Roku streams and other platforms experiencing latency.
Compounding the problem was the inability of Amazon to figure out what caused the problem, which delayed the fix. The inner-workings of AWS also suffered from this outage, impacting its own delivery drivers’ ability to scan packages, as well as impairing their ability to locate and fix the problem. The outage lasted several hours for many, even taking down a DeFi crypto-exchange in the process
We’ve been talking about how computers are hard a lot. There are few things in this world that compare to the unpredictability and complexity of technology. We often liken it to construction because code acts as the bricks of software development. The biggest difference is that in construction, a lot of the time you can predict where the problems will occur. A weak joint in a structural beam, water damage, improperly run electrical lines or plumbing. Those are visible to the naked eye, things we can see and are tangible to touch. Software development does not offer that aspect. Adding a piece of code to code that is already in place is not the same as replacing a brick in a wall. It’s not just plug-n-play or copy/paste. Code can (and does) act in unpredictable ways. A patch of code is not the same as a patch of drywall, you can’t just sand it and make it look like it never happened. A drywall patch isn’t going to do anything other than patch a hole. But a patch of code on the first floor could blast a hole through a window on the 8th floor, likely with no rhyme or reason as to why.
Amazon has shown us that even those considered at the top of their field, those employees they have on staff ensuring their systems stay online and operational, are subject to their own humanity. They cannot predict everything, they cannot account for every little thing because no one can. There will always be glitches in technology, it will always be imperfect because it was created by imperfect humans. The sooner we understand that and accept that technology will always be unpredictable, the more prepared we can be going forward.
As always, review your systems. Make sure everything is up-to-date and patched accordingly. Make sure you bring in an expert to do a review of your business, not just your security. Make sure you’re properly set up in the cloud, that the cloud services you are using are appropriate and cost-effective, that your alerts and monitoring systems are healthy and stable. It’s so easy to make a mistake or miss a problem when you look at it all day, so bring in someone with fresh eyes and let them make sure nothing is missed. But, above all else, make sure you have backups for all mission-critical systems and processes in the event your business is impacted by a cloud outage. They are rare, but they do happen, so make sure you’re prepared!