This past week I got a call in the middle of the night from my team that a major web site we operate had gone down. The reason: Amazon’s EC2 service was having issues.
This is the outage that famously interrupted access to web sites ordinarily visited by millions of people, knocked Reddit alternately offline or into an emergency read-only mode for about a day (or more?) and made mention in the Wall Street Journal, MSNBC and other major news outlets.
In the Northern Virginia region where the outage occurred and where we were hosted, Amazon divides the EC2 service into four availability zones. We were unlucky enough to have the most recent copies of crucial data in exactly the wrong availability zone, and this made nearly impossible an immediate graceful fail-over to another zone because the data was not retrievable at the time. Furthermore, we were unable to immediately transition to another region because our AMI’s (Amazon Machine Images) were stuck in the crippled Northern Virginia region and we lacked pre-arranged procedures to migrate services.
While in the works, we had not yet established procedures to migrate to another region. Having some faith in Amazon’s engineering team, we decided to stand pat. Our belief was that by the time we took mitigating measures, Amazon’s services would be back to life anyways. And … that proved to be true to the extent that we needed.
The lessons learned are this:
(1) Replicate your data across multiple Amazon regions
(2) Do 1 with your machine images and configuration
(3) For extra safety, do 1 and 2 with another cloud provider as well
(4) It’s probably a good idea to also do an off-cloud backup
Had we already done just 1 and 2, our downtime would have been measured in minutes, not hours as one of our SA’s flipped a few switches… all WHILE STAYING on Amazon systems. Notice how Amazon’s shopping site never seemed to go down? I suspect they do this.
As for the coverage stating that Amazon is down for a third day and horribly crippled, I can tell you that we are operating around the present issues, are still on Amazon infrastructure and are not significantly impacted at this time. Had we completed implementation of our contingency plans only within Amazon by the time this happened, things would have barely skipped a beat.
So, take the hype about the “Great Amazon Crash of 2011” with a grain of salt. The real lesson is that in today’s cloud contingency planning still counts. Amazon resources providing alternatives in California, Ireland, Tokyo and Singapore have hummed along without a hiccup throughout this time.
If Amazon would make it easier to move or replicate things among regions, this would make implementation of our contingency plans easier. If cloud providers in general could make portability among each other a point and click affair, that would be even better.
Other services such as Amazon’s RDS (Relational Database Service) and Beanstalk rely on EC2 as a sub-component. As such, they were impacted as well. The core issue at Amazon appears to have involved the storage component upon which EC2 increasingly relies upon: EBS. Ultimately, a series of related failures and overload of remaining online systems caused instability across many components within the same data center.
Moving into the future, I would like to see a world where Amazon moves resources automagically across data centers and replicates in multiple regions seamlessly. Also, I question the nature of the storage systems behind the scenes that power things like EBS, and until I have more information it is difficult to comment on their robustness.
Both users and providers of clouds should take steps to get away from reliance on a single data center. Initially, the burden by necessity falls on the cloud’s customers. Over time, providers should develop ways such that global distribution and redundancy happen more seamlessly.
Going higher level, components must be designed to operate as autonomously as possible. If a system goes down in New York City, and a system in London relies upon that system, then London may go down as well. Therefore, a burden also exists to design software and/or infrastructure that carefully take into account all failure or degradation scenarios.
Way cool! Some very valid points! I appreciate you penning this post and also
the rest of the website is extremely good.