Facebook Provided Details About the Outage



On October 4, 2021, Facebook suddenly went down. The problem extended to Instagram and WhatsApp. In my opinion, this situation might be a good example of why allowing one giant company to continually purchase its competitors is a bad idea. If those services were independent from each other – the problem that made Facebook inaccessible would not have extended to Instagram and WhatsApp.

On the day of the outage, Facebook tweeted: “We’re aware that some people are having trouble accessing our apps and products. We’re working to get things back to normal as quickly as possible, and we apologize for any inconvenience.”

There is something amusing about Facebook having to resort to Twitter in order to connect to people who could not longer access Facebook’s products.

Yesterday, the Facebook Engineering blog posted an article titled: “More details about the October 4 outage”. It was written by Santosh Janardhan. Here are a few key paragraphs from the blog post:

“…This outage was triggered by the system that manages our global backbone network capacity. The backbone is the network Facebook has built to connect all our computing facilities together, which consists of tens of thousands of miles of fiber-optic cables crossing the globe and linking all our data centers.”

“… The data traffic between all these computing facilities is managed by routers, which figure out where to send all the incoming and outgoing data. And in the extensive day-to-day work of maintaining this infrastructure, our engineers often need to take part of the backbone offline for maintenance – perhaps repairing a fiber line, adding more capacity or updating the router itself.

“This was the source of yesterday’s outage. During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally. Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command…”

A “bug” in Facebook’s own audit tools crashed Facebook. This situation makes me think of the horror movies where someone is absolutely terrified and calls 911, only to learn the scary calls they had been receiving came from inside the house.