Cloudy With a Chance of Outages: Amazon Web Service Disruptions Continue

  • Share
  • Read Later

Another day, another staggering mess for Amazon’s cloud-based web service, on the fritz since yesterday and causing chaos for sites like Foursquare, Hootsuite, Quora, and Reddit. As of this post, it looks like Reddit’s partly back, as is Box Office Mojo, but determining who’s up or not (or by how much) is like picking through a haystack.

We’re seeing a bit of the fallout ourselves here at Techland: Our activity-tracking service says it’s “still recovering from ongoing issues with Amazon Web Services” and that we might be missing historical data. Not a huge deal at the moment, but an indication of just how wide-reaching the outage is.

Unlike Sony, whose PlayStation Network has been down since Wednesday evening (the two events are unrelated), Amazon’s actually tried to explain in some detail what went haywire yesterday, and why.

According to Amazon, “a networking event” in its North Virginia data center caused a bunch of servers to start redundantly backing things up, chewed through server space, and, well, you can imagine what happened next. Not that you need to if you’re one of the afflicted. Sites are coming back online, but it’s hit and miss.

And it’s not just bad news for customers: The egg on Amazon’s face is staggering. Amazon touted its service as fully redundant, claiming it was isolating servers in different “availability” zones, meaning an outage in one wouldn’t impact another. So much for that claim, or the notion that with AWS, customers could fire and forget.

According to FathomDB‘s Justin Santa Barbara (via SFGate) it’s Amazon’s fault for not following its own rules: “Whether that happened through incompetence or dishonesty or something a lot more forgivable entirely, we simply don’t know at this point.”

Welcome to your “cloud”-based future, where handing control of your services to external providers (like Amazon) remains a risky proposition. The lesson here isn’t that cloud computing doesn’t work–it does, and events like this only ensure it’ll get better–but that cloud-computing clients still need disaster recovery plans in the event error, incompetence, or fate kick the legs out from under the table.