Amazon Availability Zones are such a fucking lie. ("Shared nothing, and an outage affecting one will not affect other Zones in the Region.")
I've seen more failures which take out multiple AZs than which take out only a single AZ. So, a prudent person would split their application across regions (which are relatively shared nothing, except for admin/account level stuff), but Amazon goes out of its way to not make that easy -- you're using the public Internet, pay higher costs, etc.
The right choice is probably EC2 plus a non-EC2 provider (your own hosted stuff, another cloud (?), etc.), with protection in case either goes down. But that is a relatively lot of work, and if you're on a PaaS like Heroku which is 100% at risk to EC2, you can't do it.
Even funnier are people doing server monitoring of (things in EC2) from within EC2. When the EC2 outage happens, there's obviously no problem because no alerts get sent...
For some people, that might be fine. If you don't have plans for how to rapidly move out of EC2, you might as well just sleep through an all-of-EC2-goes-down outage for all you can do about it.
You should at least know there is an outage to have something to tell your downstream customers. It is really embarrassing to have a customer (or your boss) call to report an outage you don't yet know about, even if there is fuck all you can do to resolve it. Basic principle of ops.
If my case can help you, my company uses services of one company for load-balancing trafic across multiple CDN/Cloud. We are no longer impacted by the failure of some providers. You can read this http://tinyurl.com/7pwfza7 (i'm user, not vendor)
I can't figure out why you people are using URL shorteners on HN, but I believe it is not looked upon well. So, for others, these links are as follows:
Do you work for a DNS provider or CDN or something (so as to see this in near realtime)? Envy.
I haven't seen a lot of people using both EC2 and Terremark for the same app -- kind of different markets. Not technically unreasonable, but Terremark seems to be more enterprise IT outsourcing, and EC2 (followed at very far remove by the other clouds, including Rackspace) being Internet-delivered consumer, etc. apps, or at least larger scale public services.
Here's an idea I've thought about but don't have time to do anything with: a peer-to-peer monitoring network, so each new server on each new network makes it more robust. No idea how the details would work out.
That gets done for network/application performance monitoring (alternatives to keynote, gomez, etc., and is how some of their own products work). It's kind of overkill for basic application level monitoring -- there's a tradeoff between number of endpoints checking and frequency of checks. I guess you could round-robin checks across a larger number of end nodes, too, to get both.
We're set up across multiple AZs in the affected region, and all we had was a few minutes of failed requests to one AZ until our systems automatically shifted all the traffic to another.
Even the major day-long outage last year because we had (at the time) not really spread ALL our core systems across multiple AZs. We just re-launched those systems on another AZ and everything was up and running again.
The most recent one (I think it may have been ELB specific; I don't have a huge sample set), and the big EBS outage (which only affected multiple AZs somewhat)
You need different regions for DR in any case. It's hardly unprecedented for network issues like this to take down multiple data centers in an area even when they're not part of the same provider.
AZs are supposed to be distinct datacenters within a single region. If all of your customers are in (e.g.) APAC, it's not unreasonable to put all your online processing within APAC, with high bandwidth connectivity between them and from each to customers. You might not be able to do master-master over extremely long distances for performance reasons under normal conditions, but you'd keep warm or cold backups totally out of the area. There are a lot of factors which go into the decision, but there are definitely times when 2 datacenters (often run by separate providers) with independent connectivity, but both within a specific distance, makes more sense than extreme separation.
It's sad how people knew how to do this stuff ~2002-2006 and then forgot it all (or just stopped caring) once the delicious cake of cloud appeared.
You missed my point: this is not a cloud problem except to vendors looking to sell non-cloud hosting. Any region is vulnerable - some clown with a backhoe, congestion / DDoS, routing screwups, etc. have taken out data centers in entire areas (Los Angeles, SF, NY, etc.) even when providers thought they had more redundancy. If you really need it, you spend the money on wide geographic separation.
For this reason I'm using a set of different VPS servers running on both Linode (UK datacenter) and Slicehost (US datacenter).
So separate datacenters, admin layer, providers and also important: billing.
Running a high available cluster in this setup isn't trivial though, mostly due to network splits. It works quite well for specific purposes where availability is more important than data integrity. (remote monitoring in this case)
That my plan too. By using dual clouds (again in UK and US), we're getting the highest failover protection we can afford. I can't afford our e-commerce platform to be down and the evidence shows that a single cloud is robust enough. We call it "Cloud Docking" :)
The Cloud isn't infalible; doesn't solve all the problems like everyone says; news at 11.
*Downvote if you have legitimate technical reason I'm wrong, not just because you throw a hissy fit that your technology of the week isn't all that and a bag of chips.
This does make me tempted to just put everything in one zone (less latency) and have a backup on another region entirely. Clearly, backups on different AZs isn't the best plan
I've seen more failures which take out multiple AZs than which take out only a single AZ. So, a prudent person would split their application across regions (which are relatively shared nothing, except for admin/account level stuff), but Amazon goes out of its way to not make that easy -- you're using the public Internet, pay higher costs, etc.
The right choice is probably EC2 plus a non-EC2 provider (your own hosted stuff, another cloud (?), etc.), with protection in case either goes down. But that is a relatively lot of work, and if you're on a PaaS like Heroku which is 100% at risk to EC2, you can't do it.
Kind of sucks.