Facebook explains 'worst outage in over four years'

Error handling system spiralled out of control
Author:
Publish date:
96_facebook184.jpg

Facebook's chief engineer Robert Johnson has explained what went wrong in what he called the "the worst outage we've had in over four years."

The world's most popular web site was down between around 8.30pm and midnight British time but this followed a series of slow-downs and outages over the previous two days. The social networking giant came back up posting an apology to its 500 million users and explaining that an automated system was at fault.

"The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition. An automated system for verifying configuration values ended up causing much more damage than it fixed," said Facebook engineering director Robert Johnson.

Johnson went into detail concerning the nature of the system that that was designed to rectify invalid "configuration values" in the cached version of the web site. Enormously popular web sites like Facebook rely on network of local copies of content in order to deliver speedy web access. The software safeguard went haywire when the upstream 'persistent store' configuration itself was incorrect.

The software system entered a feedback loop where ever escalating amounts of database queries quickly overwhealmed even the capacity of Facebook's formidable array of server hardware. 

"The way to stop the feedback cycle was quite painful - we had to stop all traffic to this database cluster, which meant turning off the site. Once the databases had recovered and the root cause had been fixed, we slowly allowed more people back onto the site."

Facebook has disabled the faulty software system and is looking at new designs which would "deal more gracefully with feedback loops and transient spikes," Johnson said. 

Johnson apologized for the site outage and said that Facebook takes performance and reliability "very seriously."

Related