Meta’s platforms disappeared from the internet this past October during a routine process. Computers are hard, though, the fix was not so routine.
On Monday, October 4th, Facebook, Instagram, WhatsApp and Oculus VR systems suddenly stopped working. The social media giant quite literally disappeared from the internet around 11:40am. The problem appears to be a misconfiguration that occurred during a routine update, but the fix wasn’t quite so simple, taking 6 hours to complete. Computers are hard, even for big tech companies.
The outage was Facebook’s largest and lengthiest outage since 2019, when the platform was down for 24 hours, but that didn’t change the impact to small businesses around the world. Most of us in the U.S. have other methods of communicating with friends and family, but in many countries, the Meta family of platforms is their only method of communication. Without access to those, businesses lost their sole form of selling to customers, communicating with clients, retrieving orders and more. Facebook itself lost $6 billion in those 6 hours, but Facebook will bounce back. The same may not be true for others.
According to their status page, “Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt.”
Initially described as a DNS issue, this turned out to be a symptom of the problem, which actually stemmed from a routine update to the Border Gateway Protocol (BGP). BGP is how traffic is directed between data centers and the internet, it’s like a road map. Basically, the update caused a misconfiguration that deleted this road map, meaning that no device knew how to locate the domain.
Complicating matters is that Facebook uses its own tools to run its business, so much of their internal systems were nonfunctional as well. Which led to their inability to fix the problem remotely. The problem had to be fixed at a physical data center in California, and the people on-site were not the people who knew what to do, according to a reddit user. Once that problem was solved, Facebook was able to get its platforms back up and running, albeit 6 hours later.
Facebook is a social media giant, a big tech company which many of us rely on for a variety of uses every day. They likely have a pretty knowledgeable and educated staff who are not prone to making mistakes, but it just shows that events like this can happen to anyone. And it’s not always a simple fix, especially once something is already broken. Luckily for its millions of users around the world, Facebook managed to regain functionality.
No matter how long you’ve been in business, how big your company is or how amazing your employees are, sometimes things go wrong. Sometimes outcomes are nothing like we anticipate, and sometimes routine maintenance isn’t so routine. There are a lot of lessons to learn from the Facebook outage, but there are two major takeaways: Computers are hard, and this could happen to any business at any time. In this case, it was fixable, but that doesn’t mean the next problem or the next company to suffer will be able to say the same.