Amazon Availability Zones are such a fucking lie. ("Shared nothing, and an outag... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		rdl on March 15, 2012 \| parent \| context \| favorite \| on: Another EC2 outage (yet AWS dashboard says no erro... Amazon Availability Zones are such a fucking lie. ("Shared nothing, and an outage affecting one will not affect other Zones in the Region.") I've seen more failures which take out multiple AZs than which take out only a single AZ. So, a prudent person would split their application across regions (which are relatively shared nothing, except for admin/account level stuff), but Amazon goes out of its way to not make that easy -- you're using the public Internet, pay higher costs, etc. The right choice is probably EC2 plus a non-EC2 provider (your own hosted stuff, another cloud (?), etc.), with protection in case either goes down. But that is a relatively lot of work, and if you're on a PaaS like Heroku which is 100% at risk to EC2, you can't do it. Kind of sucks.

dsl on March 15, 2012 | [–]

My Nagios awoken 3 AM brain finds no fault in your logic.

rdl on March 15, 2012 | | [–]

Even funnier are people doing server monitoring of (things in EC2) from within EC2. When the EC2 outage happens, there's obviously no problem because no alerts get sent...

Doh!

cperciva on March 15, 2012 | | | [–]

For some people, that might be fine. If you don't have plans for how to rapidly move out of EC2, you might as well just sleep through an all-of-EC2-goes-down outage for all you can do about it.

rdl on March 15, 2012 | | | [–]

You should at least know there is an outage to have something to tell your downstream customers. It is really embarrassing to have a customer (or your boss) call to report an outage you don't yet know about, even if there is fuck all you can do to resolve it. Basic principle of ops.

cperciva on March 15, 2012 | | | [–]

I wasn't being entirely serious. :-)

zackattack on March 15, 2012 | | | | [–]

> Basic principle of ops

For my benefit, what are some others?

rdl on March 15, 2012 | | | [–]

This would actually be an interesting blog post.

pjscott on March 15, 2012 | | | | [–]

This is why my sleepy 3AM brain was awoken by Pingdom. Hooray for having just enough redundancy to tell you that it's not quite enough.

Good night.

mickeymoose on March 15, 2012 | | | [–]

Me, i use specific load balancing for my trafic when Cloud outage is detected. And i sleep perfectly ;-)

sghael on March 15, 2012 | | | [–]

Could you give a little more detail on your setup? I'm curious how others are designing around these issues.

flojibi on March 15, 2012 | | | [–]

If my case can help you, my company uses services of one company for load-balancing trafic across multiple CDN/Cloud. We are no longer impacted by the failure of some providers. You can read this http://tinyurl.com/7pwfza7 (i'm user, not vendor)

18pfsmt on March 15, 2012 | | | [–]

I can't figure out why you people are using URL shorteners on HN, but I believe it is not looked upon well. So, for others, these links are as follows:

http://www.theregister.co.uk/2012/02/17/cedexis_and_the_open...

http://translate.google.fr/translate?hl=fr&sl=fr&tl=...

mickeymoose on March 15, 2012 | | | | [–]

Very interesting flojibi. Another about multi cloud: http://bit.ly/zg37FQ

dsl on March 15, 2012 | | | | [–]

Even funnier than that is watching the latency hit at Rackspace Cloud and Terremark as some non-trivial number of customers fail over.

rdl on March 15, 2012 | | | [–]

Do you work for a DNS provider or CDN or something (so as to see this in near realtime)? Envy.

I haven't seen a lot of people using both EC2 and Terremark for the same app -- kind of different markets. Not technically unreasonable, but Terremark seems to be more enterprise IT outsourcing, and EC2 (followed at very far remove by the other clouds, including Rackspace) being Internet-delivered consumer, etc. apps, or at least larger scale public services.

davidw on March 15, 2012 | | | | [–]

Here's an idea I've thought about but don't have time to do anything with: a peer-to-peer monitoring network, so each new server on each new network makes it more robust. No idea how the details would work out.

rdl on March 15, 2012 | | | [–]

That gets done for network/application performance monitoring (alternatives to keynote, gomez, etc., and is how some of their own products work). It's kind of overkill for basic application level monitoring -- there's a tradeoff between number of endpoints checking and frequency of checks. I guess you could round-robin checks across a larger number of end nodes, too, to get both.

bendilts on March 15, 2012 | | [–]

We're set up across multiple AZs in the affected region, and all we had was a few minutes of failed requests to one AZ until our systems automatically shifted all the traffic to another.

Even the major day-long outage last year because we had (at the time) not really spread ALL our core systems across multiple AZs. We just re-launched those systems on another AZ and everything was up and running again.

Which outages have affected multiple AZs?

rdl on March 15, 2012 | | [–]

The most recent one (I think it may have been ELB specific; I don't have a huge sample set), and the big EBS outage (which only affected multiple AZs somewhat)

acdha on March 15, 2012 | | [–]

You need different regions for DR in any case. It's hardly unprecedented for network issues like this to take down multiple data centers in an area even when they're not part of the same provider.

rdl on March 15, 2012 | | [–]

AZs are supposed to be distinct datacenters within a single region. If all of your customers are in (e.g.) APAC, it's not unreasonable to put all your online processing within APAC, with high bandwidth connectivity between them and from each to customers. You might not be able to do master-master over extremely long distances for performance reasons under normal conditions, but you'd keep warm or cold backups totally out of the area. There are a lot of factors which go into the decision, but there are definitely times when 2 datacenters (often run by separate providers) with independent connectivity, but both within a specific distance, makes more sense than extreme separation.

It's sad how people knew how to do this stuff ~2002-2006 and then forgot it all (or just stopped caring) once the delicious cake of cloud appeared.

acdha on March 15, 2012 | | | [–]

You missed my point: this is not a cloud problem except to vendors looking to sell non-cloud hosting. Any region is vulnerable - some clown with a backhoe, congestion / DDoS, routing screwups, etc. have taken out data centers in entire areas (Los Angeles, SF, NY, etc.) even when providers thought they had more redundancy. If you really need it, you spend the money on wide geographic separation.

whouweling on March 15, 2012 | | [–]

For this reason I'm using a set of different VPS servers running on both Linode (UK datacenter) and Slicehost (US datacenter).

So separate datacenters, admin layer, providers and also important: billing.

Running a high available cluster in this setup isn't trivial though, mostly due to network splits. It works quite well for specific purposes where availability is more important than data integrity. (remote monitoring in this case)

deejbee on March 15, 2012 | | [–]

That my plan too. By using dual clouds (again in UK and US), we're getting the highest failover protection we can afford. I can't afford our e-commerce platform to be down and the evidence shows that a single cloud is robust enough. We call it "Cloud Docking" :)

toomuchtodo on March 15, 2012 | | [–]

The Cloud isn't infalible; doesn't solve all the problems like everyone says; news at 11.

*Downvote if you have legitimate technical reason I'm wrong, not just because you throw a hissy fit that your technology of the week isn't all that and a bag of chips.

colinhowe on March 15, 2012 | | [–]

This does make me tempted to just put everything in one zone (less latency) and have a backup on another region entirely. Clearly, backups on different AZs isn't the best plan

ef4 on March 15, 2012 | [–]

http://www.quickmeme.com/meme/3obpuo/

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact