Amazon’s EBS system caused a days-long outage last week, which impacted almost everyone in the us-east-1 region. I love reading a good postmortem, so I’m collecting here the useful writeups I’ve found (mostly on Hacker News) explaining what happened and how to improve.
Postmortems
- Official AWS postmortem: http://aws.amazon.com/message/65648/
- Netflix: http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html
- Heroku: http://status.heroku.com/incident/151
- Twilio: http://www.twilio.com/engineering/2011/04/22/why-twilio-wasnt-affected-by-todays-aws-issues/
- Eric Hammond (AWS guru): http://alestic.com/2011/04/ec2-outage
- SimpleGEO: http://developers.simplegeo.com/blog/2011/04/26/how-simplegeo-stayed-up/
Analysis
- Greplin: http://tech.blog.greplin.com/aws-best-practices-and-benchmarks
- CloudHarmony: http://cloudharmony.com/b/2011/04/unofficial-ec2-outage-postmortem-sky-is.html
- Joyent: http://joyeur.com/2011/04/24/magical-block-store-when-abstractions-fail-us/
- O'reilly: http://broadcast.oreilly.com/2011/04/the-aws-outage-the-clouds-shining-moment.html
- arledge-blog reblogged this from ujeezy-blog
- jemerick liked this
- ujeezy-blog posted this