I'm the author of that post. I'm happy to respond to comments on the post on thi...

dataflow · on Sept 21, 2018

The problem isn't a lack of very accurate clocks per se, right? What you need is an accurate bound on clock error, whatever it might be. It sounds to me that non-Googles are specifying bound parameters that, unlike Google, they cannot guarantee. Doesn't that mean the blame goes to whomever put inaccurate bound parameters into the software? Why blame the algorithm for garbage input?

xyzzy_plugh · on Sept 21, 2018

> It sounds to me that non-Googles are specifying bound parameters that, unlike Google, they cannot guarantee.

I've been complaining about this for years and it's so nice to see others echo the sentiment. Everyone chases timestamps but in reality they're harder to get right than most people are willing to acknowledge.

Very few places get NTP right at large-scale, or at least within accuracies required for this class of consistently. I've never seen anyone seriously measure their SLO for clock drift, often because their observability stack is incapable of achieving the necessary resolutions. Most places hand-wave the issue entirely and just assume their clocks will be fine.

The paper linked within TFA suggests a hybrid clock which is better but still carries some complications. I'll continue to recommend vector clocks despite their shortcomings.

CydeWeys · on Sept 21, 2018

The problem is that the bound on clock error directly affects your performance. So if you're willing to accept, say, a second in clock error, then all transactions will take a minimum of one second. That level of performance is going to be unacceptable in many situations.

The potential clock error on VMs without dedicated time-keeping hardware is so large that performance turns into absolute garbage.

dataflow · on Sept 21, 2018

I would understand if the complaint was that Spanner is too slow without expensively accurate clocks and synchronization. But the complaint is that Spanner fails to guarantee consistency, which doesn't make sense to me. The requirements clearly include giving a valid clock bound, so if you give an invalid clock bound, it's clearly your fault for getting incorrect results, not Spanner's!

CydeWeys · on Sept 21, 2018

Spanner does guarantee consistency, thanks to its use of hardware atomic clocks and GPS. It's alternatives like CockroachDB that don't have this dedicated hardware that can fail to guarantee consistency if clocks get of sync (a problem that can't happen in Spanner).

Spanner is really fast and massively parallelizable.

mvijaykarthik · on Sept 22, 2018

We recently open-sourced https://github.com/rubrikinc/kronos for the exact same problem. Coincidentally I shared that on Show HN just today: https://news.ycombinator.com/item?id=18037609

CydeWeys · on Sept 23, 2018

Neat. The obvious question would be, how does kronos compare with ntpd? Do they work together, is kronos a replacement, do they solve different problems, etc.? Is it expected that all the servers in a cluster synced by kronos are already running ntpd, and that kronos provides an additional level of reduced skew on top of that?

It'd be great if you could address this somewhere in the top level README on GitHub.

mvijaykarthik · on Sept 24, 2018

I've added a section in the README for this. It works in conjunction with NTPD and the time provided by this library has some extra guarantees like monotonicity, immunity to large clock jumps etc (more info in the README).

joemag · on Sept 21, 2018

Hi, you used DynamoDB as an example of a weakly consistent system in the opening paragraph, but it actually supports both modes [1]. The point of confusion might come from the fact that the service described in 2007 Dynamo paper was an inspiration for DynamoDB, rather than DynamoDB itself.

Disclaimer, I work for AWS, but not on DynamoDB team.

[1] https://docs.aws.amazon.com/amazondynamodb/latest/developerg...

abadid · on Sept 21, 2018

Thanks. I added a parenthetical remark to the post to indicate that I was talking about DynamoDB's default settings.

Twirrim · on Sept 21, 2018

It's not really the default settings, per se. You don't have to change any bit of configuration about your database to get consistency. The DynamoDB API gives you the GetItem API call and a boolean property to choose to make it a consistent read.

It's left as a very simple task for developers leveraging DynamoDB to make the appropriate trade offs on consistent or inconsistent read.

source: Used to work for AWS on a service that heavily leveraged DynamoDB. Not _once_ did we experience any problems with consistency or reliability, despite them and us going through numerous network partitions in that time. The only major issue came towards the end of my time there when DynamoDB had that complete service collapse for several hours.

On the sheer scale that DynamoDB operates at, it's more likely to be a question of "How many did we automatically handle this week?" than "How often do we have to deal with network partitions?"

dsp1234 · on Sept 21, 2018

From the GetItem docs[0]

"GetItem provides an eventually consistent read by default."

This seems to meet the definition of "DynamoDB's default settings"

[0] - https://docs.aws.amazon.com/amazondynamodb/latest/APIReferen...

rch · on Sept 21, 2018

It's enough of a gray area to make DynamoDB a poor example in this context, since if I were to claim that it was eventually consistent without additional configuration, then an informed person might reasonably assume I didn't know what I was talking about.

It would be better to state that both eventually consistent and fully consistent reads are available, and consistency can be enforced up front via configuration.

jgalentine007 · on Sept 21, 2018

I recently came across CockroachDB and thought it's capabilities interesting, almost too good to be true. I also have been looking at Citus Data which shards and distributes transactions, are you aware of any consistency shortcomings with it?

cube2222 · on Sept 21, 2018

CockroachDB's performance is dependent on time synchronization. If a node detects it's too far behind, it will commit suicide.

However, before it detects it, there is a possibility of stale reads.

https://www.cockroachlabs.com/docs/stable/recommended-produc...

jgalentine007 · on Sept 21, 2018

Thanks for pointing that out - I have yet to rtfm and dive deep. I wonder how frequently time sync problems occur in virtual environments after ntp syncing - I've seen pretty erratic behavior on virtual active directory domain controllers even after syncing with hyper-v and vmware.

brazzledazzle · on Sept 21, 2018

It’s been a long time since I messed with domain controllers but I believe Microsoft used to have explicit guides for handling time on virtual DCs. At that time we kept around a a couple hardware DCs to be safe but I do remember having the VMware agent correct the time could result in some bad results. I think it was because it immediately fixed the time and didn’t slowly correct the drift but it’s been a long time so my memory could be off.

gurjeet · on Sept 22, 2018

Here’s their blog post on how they manage to live without atomic clocks. I’ve found at least one assumption of thiers that i don’t agree with, but notwithstanding that, it’s a good read.

https://www.cockroachlabs.com/blog/living-without-atomic-clo...

amaccuish · on Sept 21, 2018

Time shouldn't be a massive issue for AD no? It's a vector clock, not a UTC clock. The UTC clock is only used to solve conflicts no?

e12e · on Sept 22, 2018

Why do you say that? I thought kerberos depended on timestamps +/- drift?

https://tools.ietf.org/html/rfc4120#section-5.2.3

Or do you mean some other part of AD?

amaccuish · on Sept 24, 2018

It defaults to 5 minutes. I perhaps wrongly assumed the issue was within 5 minutes.

atombender · on Sept 21, 2018

My experience from test workloads on Cockroach is that single transaction performance is very bad compared to something like Postgres -- with 2.0 I was seeing easily 10x worse performance than Postgres with a three-node test cluster on Google Cloud. My impression is that it's worse for apps that have lots of small CRUD transactions on low numbers of rows, as is typical with web/mobile UIs.

Aggregate cluster performance seemed very good, though; i.e., adding a bunch more concurrent transactions did not slow down the other transactions noticeably.

felixgallo · on Sept 21, 2018

why didn't you disclose your connections in your post? It appears you've written much the same article for FaunaDB's official blog in the past (https://fauna.com/blog/distributed-consistency-at-scale-span...).

abadid · on Sept 21, 2018

That post was actually quite different than this one. This one focuses on distributed vs. global consensus.

But I added a note at the end that clearly documents my connection to FaunaDB. As far as Calvin, the post itself clearly says that it came out of my research group.

vidarh · on Sept 21, 2018

The article specifically states that Calvin comes out of his research group and that FaunaDB was inspired by Calvin.

xtacy · on Sept 21, 2018

Nice writeup, but given the title, I was hoping you would also have a constructive test case that shows that these systems fail to meet their guarantees (like Jepsen did). :)

I wonder if any of the aforementioned systems (Calvin/Spanner/YugaByte) that can opportunistically commit transactions and detect issues and roll back + retry all within the scope of the RPC so it can still conform to linearisability requirement?

plopz · on Sept 21, 2018

I've only ever worked on small projects so I'm not at all familiar with these very high-scale distributed databases but from the post it seems to indicate that Spanner is in a league of its own because it integrated hardware into the mix where the others are software only. What are the differences in scale between the two categories mentioned?

abadid · on Sept 21, 2018

Yes, Spanner is quite unusual in the distributed database world in how hardware is a pretty important part of their solution. Other systems may claim important integrations with hardware, but for Spanner, the architecture really relies on particular hardware assumptions.

To answer your question about scale: there is no real practical difference in scalability between the two categories discussed in the post. Partitioned consensus has better theoretical scalability, but I am not aware of any real-world workload that can not be handled by global consensus with batching.

ryanobjc · on Sept 21, 2018

I think this "global consensus with batching does everything partitioned does" is a very much theory vs practice type of statement. As in, in theory there is no difference between theory and practice :-)

I've seen those batched consensus systems, and honestly, you're kidding yourself if you think they can handle a million qps. Just transmit time of the data on ethernet would become an issue alone! Even with 40 gig - transmit time never becomes free. So now you're stuffing 1/10th of a million qps worth of data via a single set of machines (3, 5, 7, 9? Some relatively small amount)

Am I misunderstanding you? Hopefully I am!

0xbadcafebee · on Sept 21, 2018

"Systems that guarantee consistency only experience a necessary reduction in availability in the event of a network partition."

Many of the distributed clusters I've maintained had crap infrastructure and no change control, and parts of the clusters were constantly going down from lack of storage, CPU and RAM, or bad changes. The only reason the applications kept working were either (1) the not-broken vnodes continued operating as normal and only broken vnodes were temporarily unavailable, or (2) we shifted traffic to a working region and replication automatically caught up the bad cluster once it was fixed. Clients experienced increased error rates due primarily to these infrastructure problems, and very rarely from network partition.

Does your consistent model take this into account, or do you really assume that network partition will be the only problem?

mannykannot · on Sept 21, 2018

It seems you have other problems (crap infrastructure and no change control) to deal with before the issues in this article become your biggest concern, but are not the cases you list themselves partition problems?

0xbadcafebee · on Sept 21, 2018

They cause partition, but their origin isn't the network. Nobody who runs a large system has perfectly behaving infrastructure. Infrastructure always works better in a lab than in the real world. Even if you imagine your infrastructure is rock-solid, people often make assumptions, like their quota is infinite, or their application will scale past the theoretical limits of individual network segments, i/o bounding, etc.

The point is, resources cause problems, and the network is just one of many resources needed by the system. Other resources actually have more constraints on them than the network does. If a resource is constrained, it will impact availability in a highly-consistent model.

The author states that simply adding network redundancy would reduce partitions, and infrastructure problems are proof that this is very short-sighted. "You have bigger problems" - no kidding! Hence the weak-consistency model!

StreamBright · on Sept 21, 2018

Even if you maintain your infrastructure properly you run on x86 servers with disks and CPUs that need cooling, using network devices that have fascinating failure scenarios. I guess assuming that your infra is not reliable is a must for any database nowadays.

evanweaver · on Sept 21, 2018

Do these concerns also apply in an HTAP or OLAP context e.g. systems like Cloudera's Kudu, which uses Hybrid Time? Or maybe Volt which you also worked on?

nickerten · on Sept 22, 2018

I've worked on a system that loaded data into Kudu in near real time and simultaneously ran queries on the data. Kudu has no transactions, consistency is eventual which was sufficient due to our near-real time constraint, however you do need a stable NTP source. We have lost data when the cluster could not get a reliable NTP connection, decided to shut down and tablet servers' data files became corrupted.

kwillets · on Sept 21, 2018

Vertica seems to include some of these observations, such as global consistency and group commit. I believe these are easier to achieve (lower overhead) in OLAP due to fewer, larger transactions.

xtacy · on Sept 21, 2018

OLAP systems tend to be read only (for analytics) so the question of transactions and consistency isn't really applicable.

kwillets · on Sept 21, 2018

They don't load themselves.

romed · on Sept 21, 2018

I question your statement that building apps on weakly-consistent systems is so difficult. I’ve worked on very large scale systems that you’ve definitely heard of and probably used that are built atop storage systems with very weak semantics and asynchronous replication. Aren’t such systems existence proofs, or do you think there’s just a huge difference between the abilities of engineers in various organizations?

cloakandswagger · on Sept 21, 2018

He said it was difficult, not impossible.

I think it's a fairly noncontroversial statement. Dealing with eventual consistency is always going to be more difficult and require more careful thought and preparation than immediately consistent systems.

How many programmers do you think are out there that have only ever worked on systems that use a single RDBMS instance, and what would happen if they tried to apply their techniques to a distributed, eventually consistent environment?

wpietri · on Sept 21, 2018

Exactly. My dad was coding when DBMSes rose to prominence, and it was basically a way to take a bunch of things that were hard to think about and sweep them under the rug. People wrote plenty of good software before they existed, but if you wanted to write a piece of data, you had to think about which disk and where on the disk and exactly the record format. Most programmers just wanted a genie that they could give data to and later ask for it back.

It's the same today, but worse. Most programmers still want a simple abstraction that lets them just build things. But now it's not just which sector on which disk, but also which server in which data center on which continent, while withstanding the larger number of failure modes at that scale.

When necessary, people can explicitly address that complexity. But it has a big cost, a high cognitive load.

groestl · on Sept 21, 2018

The biggest challenge, in my experience, is explaining the weak consistency guarantees to stakeholders, especially in a QA setting (e.g. product owner demos the product to colleagues, numbers are not immediately up-to-date).

CBLT · on Sept 21, 2018

Great read, thank you for sharing. Do you have any opinion on the design of Eris[0]? Consistency is achieved with extra hardware, but that hardware is a network-level sequencer.

[0]: https://syslab.cs.washington.edu/papers/eris-sosp17.pdf

rwtxn · on Sept 21, 2018

AFAIK Eris (and previous similar work from the same group) assume network ordering guarantees that can be provided in a single datacenter but probably not in a WAN setting. This discussion is about distributed and geo-replicated databases.

evntdrvn · on Sept 21, 2018

I think it would be helpful to move the disclaimer to the top of the post, just for clarity's sake. Is there a forecasted date on the release of the independent Jepsen study? Who is performing it? Thanks!!

reilly3000 · on Sept 21, 2018

Is there a "best-of-both-worlds" approach that could work, or are these two approaches mutually exclusive? I have to imagine that time drift can eventually reconciled with some kind of time delta.

abadid · on Sept 21, 2018

The two approaches seem mutually exclusive to me.

jakelarkin · on Sept 21, 2018

won't batching transactions increasing the failure rate by a multiple of (batch size)? how should users reason about that tradeoff?