Among the biggest misconceptions out there is that the Cloud is simply one big data center somewhere and everyone’s livelihoods are based on it always being up and running. One of my biggest pet peeves are headlines screaming that the “cloud goes down.” Take this instance, for example, about EC2 outages in AWS’ US East Region. The headline is “Amazon Cloud Goes Down Friday Night, Taking Netflix, Instagram And Pinterest With It”. But do they really mean the Amazon Cloud?
The Amazon Cloud is 30 services across nine distinct geographical regions around the globe. Did that go down? No.
The Amazon Cloud is a minimum of two individual, geographically isolated availability zones in each region around the world. Did all of those go down? No.
In fact, did the US East Availability Zone go down? Did all of the services within the US East Region fail? No.
So why is it that the “Amazon Cloud” went down? It sounds to me like components of a network experienced a failure.
What actually happened is that there was service failure for 90 minutes (for Netflix, at least) that affected the virtual machine instances and some of the block storage components within a single availability zone in the US East Region. No data loss, no viruses, no exposures.
Additionally, saying that the “Cloud” goes down, taking others with it is also slightly disingenuous. Did Netflix cease to operate for 90 minutes? Apparently not, because the tweet identified at the article’s open notes that some users were experiencing problems, not that the entire service was down.
Forbes notes in the article that its Flipboard content wasn’t updating reliably 24 hours after the problem. Here’s my question: What do we know about the architecture of these services? A follow up article notes Instagram’s response to the outage, which had ben caused by unnaturally violent storms that affected the East Coast:
“As of Friday evening of June 29, 2012, Instagram is experiencing technical difficulties. An electrical storm in Virginia has affected most of our servers, and our team of engineers is working hard to restore service.” -- Instagram, per Forbes.com
Here’s what I make of that statement: Instagram put all of its eggs in one basket. Most of its servers are in US East? Why no balance? Why no servers in US West? Why not in Dublin? Cloud architecture should be designed for failures, just like traditional architecture. Here’s what else is noted in the follow up article: No problems from The Guardian, HootSuite, UrbanSpoon, EngineYard (PaaS). Interesting, no? Clearly, the Amazon Cloud didn’t go down, because these guys kept operating.
Cloud infrastructure is super cheap compared to traditional racking and stacking. In most instances, you’re not even charged for servers that aren’t running. There are load balancing tools, DNS routers, auto scaling devices, and on and on and on.
The truth of the matter is that the cloud didn’t go down. Companies affected long term may not have properly architected their cloud infrastructure. We don’t know. To imply that AWS is the cause of Flipboard being slow to return or for Instagram to have gone down entirely may lay entirely too much blame at the foot of a cloud provider and too little blame at the foot of the company.
Until the Amazon Cloud, or any other, experiences a total and complete loss of service, it would behoove everyone to gain a much better understanding about what happens when there’s a service outage. Let’s scale back on the ominous “Cloud Goes Down,” because that’s simply not the case.
Matt Jordan is the Cloud Services Manager for JHC Technology. He can be reached at mjordan (at) jhctechnology.com, @matt_jhc, or connect with him on LinkedIn.