Summary: Google Cloud Storage is now (significantly) cheaper, faster and more av...

Touche · on Oct 20, 2016

The best thing about Google Cloud is the Console. I haven't seen a comparable console for managed services. Think about the UX nightmare that is the AWS Management Console and take that complexity and separate it out neatly and make everything available in a fast search box and you have Google Cloud Console.

It's the primary reason I've switched over to Datastore and App Engine.

RhodesianHunter · on Oct 20, 2016

Hmm. I actually really like the AWS management console. I find that everything is logically organized and easy to manipulate. That could just be because I've been using it for a long time.

rstupek · on Oct 20, 2016

I agree with you. Google's console is neigh unusable

theDoug · on Oct 20, 2016

We're all working hard to be everyone's favorite. Thank you!

tehbeard · on Oct 20, 2016

What about ~~reliability~~ durability of storage (bitrot, not the uptime)? I didn't see any information comparing this, is it practically the same across both?

edit: durability, not reliability.

klodolph · on Oct 20, 2016

At this point, durability is hard to compare because the published numbers are total fantasies. S3's published durability is 11 9s, which I've complained about in the past because the chance that the human race will be wiped out next year by an asteroid impact is probably not too far from one in a hundred million, which puts a hard cap on durability at around eight nines.

Then ask yourself, "is the chance that someone made a mistake in designing the system which causes catastrophic data loss higher or lower than one in a billion?"

Yes, the number is total fantasy. The only actual meaning of the durability SLA is that if the durability SLA is violated, your cloud provider will give you some service credits. If the data is worth more than some service credits, this is small consolation.

In practice, if you absolutely must compare durability, what you are trying to do is compare the frequency of black swan events, but actually estimating the frequency of those events requires access to proprietary data which you don't have.

Both providers are undoubtedly storing the data in multiple locations on multiple types of storage systems with multiple layers of error coding.

After thinking about this issue earlier, the most likely data loss scenario in my opinion was "I get injured, the company hires someone incompetent who doesn't pay the bill for a year while I'm in the hospital."

ghshephard · on Oct 20, 2016

I've always seen durability from the perspective of Amazon telling me, "Look, we're going to lose your data - here's how quickly you should expect it to rot." Basically, take the number of objects you have, and multiply by: 1xe-7 each year. So, if I have 400 Billion objects, (4e11) - Amazon is going to lose approx 40,000 of them/year, and I should be prepared for that.

With regards to your Asteroid Strike (And other disasters such as Riot, Insurrection, War, Hurricane, Earthquake, etc...) - these are all disclaimed in the agreement you sign with Amazon under a clause known as Force Majeure - which essential means, "Acts of God". The durability clause comes into effect under the normal course of business, not exceptional events. For those types of scenarios, you'll want to have a business continuity plan in place, not a durability formula on your storage service.

klodolph · on Oct 20, 2016

Your numbers are off... according to Amazon's durability notes, if you have 400 billion objects then Amazon would expect to lose less than 40 per year, not 40,000. And for that level of storage, you're paying Amazon something like a quarter million dollars per year to do storage for you.

My point is that there is no point in taking that kind of durability guarantee into account, because there are much bigger threats to data storage. It's kind of like saying that driving by car is very safe if you don't get into an accident--while technically true, it's not a very useful fact.

donavanm · on Oct 21, 2016

The exact values may be off, but that is how to think of it. Losing any specific object is less likely than earth ending. But these are providers with many trillion objects. In that case there are a handful of objects that are going missing each year, on average. The durability guarantee allows customers to make an educated decision as to how much they invest in loss prevention.

But lets get back to that 400B object customer who's losing 40. How much does it cost to verify (resilver) all those bits, even annually? Is it even tenable?

klodolph · on Oct 21, 2016

This response got a bit long.

No, that's absolutely not how to think about it. In short, it's a dangerous simplification and it's not at all representative of how failures work in a system like S3 or Google Cloud Storage. It does make good marketing copy, but if you are an engineer or a program lead you have a certain level of responsibility for understanding why cloud storage does not, in practice, give you eleven 9s of durability per object every year. (At a first approximation, you would at least expect the data loss to follow a Poisson distribution, but…)

Drilling down into the techical guts, S3 and Google Storage do not store separate objects in the lower storage layers, it's too inefficient. So below the S3 / GCS object API, you have stripes of data spread across multiple data centers with error encoding, along with a redundant copy on tape or optical media. Randal Munroe estimated Google data storage at 15 EB (https://what-if.xkcd.com/63/), so taking that number, let's suppose a stripe size of 100 MB (just picking a number out of the air that seems reasonable) and you get 1.5e11 stripes. Taking Amazon's 11 nines, that gives a loss of 1 stripe every 8 months.

So, all we've done so far is look at how a system would be implemented, and we've already completely destroyed the notion that you would expect 40 of 400B objects to disappear due to bit rot in any particular year. Supposing the objects are 10KB in average size, you might expect most years to lose no objects at all, and if you lose any you might lose 10,000 at the same time—and the entire extent of your recourse is to get a service credit from your cloud provider.

The gotcha is that the system simply isn't that reliable. First of all, engineers at Amazon and Google are constantly pushing new configuration and software updates to their stack. Some of these software updates can result in catastrophic data loss, and some of these errors will not get caught by canaries. "Catastrophic" might mean metadata corruption, it might mean the loss of many stripes all at the same time, but from the cloud provider's perspective they're still meeting SLA for most of their customers so most of their customers are happy. On top of that, you also have to take into account the possibility that a design flaw in the storage media would cause massive data loss across multiple data centers simultaneously, or other nightmare scenarios like that. Given that I've personally experienced data loss due to design flaws in storage media and I only have ever owned twenty hard drives or so in my life, you can imagine that a fleet with millions of hard drives presents some unique reliability and durability problems.

You can pretend that these configuration and programming errors are "unusual events" but the fact is that stripe loss for any reason is already an unusual event, and you might as well include the most probable cause of data loss in your model if you are going to model it at all.

So, what is the SLA? It's part of a contract. It defines when the contract is performed and when it is broken. It's also a piece of marketing and sales leverage. That's all. It's not a realistic or particularly useful description of how a system actually works—so the responsible engineers and program managers at companies which use cloud services are always asking themselves, "What happens if Amazon or Google violates their SLA? Will I lose my job?"

(A footnote: You don't need to verify the bits yourself, cloud providers will send you messages when your data is lost. If you want more durability then you go multi-cloud or buy a tape library.)

(Disclosure: I work at a company that provides cloud storage.)

ghshephard · on Oct 23, 2016

I found this response to be fantastic, educational, and I've already bookmarked it for future discussions with people regarding cloud storage - so thanks very much for your response. I wish I could highlight it as an "Answer of the Month" - but at the very least, I'll share it with others when discussing the topic.

So, perhaps another way of looking at this, is the 11 9s of reliability means that Amazon is providing a guarantee that they will lose at least that many objects, but, for all the reasons that you highlighted, there underlying redundancy mechanism means that there are all sorts of reasons why a catastrophic loss could result in many orders of magnitude lost of data, not only during exceptional events, but just under the normal course of business.

I would note that your anecdote about losing "data on a hard drive" isn't as relative at cloud storage levels, because one of the things that has been drilled into my mind by a colleague who works on a cloud system at scale, is they not only assume they will lose a single device, but they also scale so that they can lose (in order), a complete Rack, a complete PDU, and a complete Data Center - and still continue to provide availability to storage. That is, in the normal course of business, they plan on losing data centers, and continuing to provide full availability (albeit, at reduced durability in that event until they restore that data center). Google takes this up a notch, and provides availability in the event that an entire region is lost. Cautious companies can roll their own Business Continuity Plans on top of this as well.

What would be interesting, but unlikely, is for Amazon/Microsoft/Google to share with their customer what their actual loss of data was in the prior periods (Say, per year), and then provide a rolling graph of actual loss. Also useful (and almost guaranteed not to be available) - is what percentage of customers lost more than their SLA each month.

donavanm · on Oct 24, 2016

Disclosure: I currently work at AWS and have a bit more than passing familiarity with large scale storage systems. I know what erasure encoding is. I know from experience just how wrong things can go.

The passing comment of distribution of errors is important. However the "1 stripe is 100Mb, and objects are 10KB, ergo you'll lose 0 or 10,000 objects" bit is bizarro. I suspect youre letting your personal experience lead to assumptions that may not be true in other implementations.

klodolph · on Oct 24, 2016

Yes, that's definitely not true on all the storage systems I've worked on. Some storage systems will pack a single customer's data into a single stripe and others won't.

billhathaway · on Oct 20, 2016

From https://cloud.google.com/storage/docs/storage-classes

[paraphrased] All storage classes are designed for 99.999999999% durability

As the previous replies to your comment mention, at that order of magnitude there are many other causes of data loss to worry about.

Disclaimer: work for an Alphabet company but not doing anything related to this.

shshhdhs · on Oct 20, 2016

I think that you mean durability

sashaafm · on Oct 20, 2016

Could you please provide a more detailed comparison? :)

user5994461 · on Oct 22, 2016

Here's an old article with lots of numbers and performance graphs:

http://blog.zachbjornson.com/2015/12/29/cloud-storage-perfor...