I'm the author of that post. I'm happy to respond to comments on the post on this thread for the next several hours. You can also leave comments on the post itself, and I will respond there at any time.
The problem isn't a lack of very accurate clocks per se, right? What you need is an accurate bound on clock error, whatever it might be. It sounds to me that non-Googles are specifying bound parameters that, unlike Google, they cannot guarantee. Doesn't that mean the blame goes to whomever put inaccurate bound parameters into the software? Why blame the algorithm for garbage input?
> It sounds to me that non-Googles are specifying bound parameters that, unlike Google, they cannot guarantee.
I've been complaining about this for years and it's so nice to see others echo the sentiment. Everyone chases timestamps but in reality they're harder to get right than most people are willing to acknowledge.
Very few places get NTP right at large-scale, or at least within accuracies required for this class of consistently. I've never seen anyone seriously measure their SLO for clock drift, often because their observability stack is incapable of achieving the necessary resolutions. Most places hand-wave the issue entirely and just assume their clocks will be fine.
The paper linked within TFA suggests a hybrid clock which is better but still carries some complications. I'll continue to recommend vector clocks despite their shortcomings.
The problem is that the bound on clock error directly affects your performance. So if you're willing to accept, say, a second in clock error, then all transactions will take a minimum of one second. That level of performance is going to be unacceptable in many situations.
The potential clock error on VMs without dedicated time-keeping hardware is so large that performance turns into absolute garbage.
I would understand if the complaint was that Spanner is too slow without expensively accurate clocks and synchronization. But the complaint is that Spanner fails to guarantee consistency, which doesn't make sense to me. The requirements clearly include giving a valid clock bound, so if you give an invalid clock bound, it's clearly your fault for getting incorrect results, not Spanner's!
Spanner does guarantee consistency, thanks to its use of hardware atomic clocks and GPS. It's alternatives like CockroachDB that don't have this dedicated hardware that can fail to guarantee consistency if clocks get of sync (a problem that can't happen in Spanner).
Spanner is really fast and massively parallelizable.
Neat. The obvious question would be, how does kronos compare with ntpd? Do they work together, is kronos a replacement, do they solve different problems, etc.? Is it expected that all the servers in a cluster synced by kronos are already running ntpd, and that kronos provides an additional level of reduced skew on top of that?
It'd be great if you could address this somewhere in the top level README on GitHub.
I've added a section in the README for this.
It works in conjunction with NTPD and the time provided by this library has some extra guarantees like monotonicity, immunity to large clock jumps etc (more info in the README).
Hi, you used DynamoDB as an example of a weakly consistent system in the opening paragraph, but it actually supports both modes [1]. The point of confusion might come from the fact that the service described in 2007 Dynamo paper was an inspiration for DynamoDB, rather than DynamoDB itself.
Disclaimer, I work for AWS, but not on DynamoDB team.
It's not really the default settings, per se. You don't have to change any bit of configuration about your database to get consistency. The DynamoDB API gives you the GetItem API call and a boolean property to choose to make it a consistent read.
It's left as a very simple task for developers leveraging DynamoDB to make the appropriate trade offs on consistent or inconsistent read.
source: Used to work for AWS on a service that heavily leveraged DynamoDB. Not _once_ did we experience any problems with consistency or reliability, despite them and us going through numerous network partitions in that time. The only major issue came towards the end of my time there when DynamoDB had that complete service collapse for several hours.
On the sheer scale that DynamoDB operates at, it's more likely to be a question of "How many did we automatically handle this week?" than "How often do we have to deal with network partitions?"
It's enough of a gray area to make DynamoDB a poor example in this context, since if I were to claim that it was eventually consistent without additional configuration, then an informed person might reasonably assume I didn't know what I was talking about.
It would be better to state that both eventually consistent and fully consistent reads are available, and consistency can be enforced up front via configuration.
I recently came across CockroachDB and thought it's capabilities interesting, almost too good to be true. I also have been looking at Citus Data which shards and distributes transactions, are you aware of any consistency shortcomings with it?
Thanks for pointing that out - I have yet to rtfm and dive deep. I wonder how frequently time sync problems occur in virtual environments after ntp syncing - I've seen pretty erratic behavior on virtual active directory domain controllers even after syncing with hyper-v and vmware.
It’s been a long time since I messed with domain controllers but I believe Microsoft used to have explicit guides for handling time on virtual DCs. At that time we kept around a a couple hardware DCs to be safe but I do remember having the VMware agent correct the time could result in some bad results. I think it was because it immediately fixed the time and didn’t slowly correct the drift but it’s been a long time so my memory could be off.
Here’s their blog post on how they manage to live without atomic clocks. I’ve found at least one assumption of thiers that i don’t agree with, but notwithstanding that, it’s a good read.
My experience from test workloads on Cockroach is that single transaction performance is very bad compared to something like Postgres -- with 2.0 I was seeing easily 10x worse performance than Postgres with a three-node test cluster on Google Cloud. My impression is that it's worse for apps that have lots of small CRUD transactions on low numbers of rows, as is typical with web/mobile UIs.
Aggregate cluster performance seemed very good, though; i.e., adding a bunch more concurrent transactions did not slow down the other transactions noticeably.
That post was actually quite different than this one. This one focuses on distributed vs. global consensus.
But I added a note at the end that clearly documents my connection to FaunaDB. As far as Calvin, the post itself clearly says that it came out of my research group.
Nice writeup, but given the title, I was hoping you would also have a constructive test case that shows that these systems fail to meet their guarantees (like Jepsen did). :)
I wonder if any of the aforementioned systems (Calvin/Spanner/YugaByte) that can opportunistically commit transactions and detect issues and roll back + retry all within the scope of the RPC so it can still conform to linearisability requirement?
I've only ever worked on small projects so I'm not at all familiar with these very high-scale distributed databases but from the post it seems to indicate that Spanner is in a league of its own because it integrated hardware into the mix where the others are software only. What are the differences in scale between the two categories mentioned?
Yes, Spanner is quite unusual in the distributed database world in how hardware is a pretty important part of their solution. Other systems may claim important integrations with hardware, but for Spanner, the architecture really relies on particular hardware assumptions.
To answer your question about scale: there is no real practical difference in scalability between the two categories discussed in the post. Partitioned consensus has better theoretical scalability, but I am not aware of any real-world workload that can not be handled by global consensus with batching.
I think this "global consensus with batching does everything partitioned does" is a very much theory vs practice type of statement. As in, in theory there is no difference between theory and practice :-)
I've seen those batched consensus systems, and honestly, you're kidding yourself if you think they can handle a million qps. Just transmit time of the data on ethernet would become an issue alone! Even with 40 gig - transmit time never becomes free. So now you're stuffing 1/10th of a million qps worth of data via a single set of machines (3, 5, 7, 9? Some relatively small amount)
"Systems that guarantee consistency only experience a necessary reduction in availability in the event of a network partition."
Many of the distributed clusters I've maintained had crap infrastructure and no change control, and parts of the clusters were constantly going down from lack of storage, CPU and RAM, or bad changes. The only reason the applications kept working were either (1) the not-broken vnodes continued operating as normal and only broken vnodes were temporarily unavailable, or (2) we shifted traffic to a working region and replication automatically caught up the bad cluster once it was fixed. Clients experienced increased error rates due primarily to these infrastructure problems, and very rarely from network partition.
Does your consistent model take this into account, or do you really assume that network partition will be the only problem?
It seems you have other problems (crap infrastructure and no change control) to deal with before the issues in this article become your biggest concern, but are not the cases you list themselves partition problems?
They cause partition, but their origin isn't the network. Nobody who runs a large system has perfectly behaving infrastructure. Infrastructure always works better in a lab than in the real world. Even if you imagine your infrastructure is rock-solid, people often make assumptions, like their quota is infinite, or their application will scale past the theoretical limits of individual network segments, i/o bounding, etc.
The point is, resources cause problems, and the network is just one of many resources needed by the system. Other resources actually have more constraints on them than the network does. If a resource is constrained, it will impact availability in a highly-consistent model.
The author states that simply adding network redundancy would reduce partitions, and infrastructure problems are proof that this is very short-sighted. "You have bigger problems" - no kidding! Hence the weak-consistency model!
Even if you maintain your infrastructure properly you run on x86 servers with disks and CPUs that need cooling, using network devices that have fascinating failure scenarios. I guess assuming that your infra is not reliable is a must for any database nowadays.
Do these concerns also apply in an HTAP or OLAP context e.g. systems like Cloudera's Kudu, which uses Hybrid Time? Or maybe Volt which you also worked on?
I've worked on a system that loaded data into Kudu in near real time and simultaneously ran queries on the data. Kudu has no transactions, consistency is eventual which was sufficient due to our near-real time constraint, however you do need a stable NTP source. We have lost data when the cluster could not get a reliable NTP connection, decided to shut down and tablet servers' data files became corrupted.
Vertica seems to include some of these observations, such as global consistency and group commit. I believe these are easier to achieve (lower overhead) in OLAP due to fewer, larger transactions.
I question your statement that building apps on weakly-consistent systems is so difficult. I’ve worked on very large scale systems that you’ve definitely heard of and probably used that are built atop storage systems with very weak semantics and asynchronous replication. Aren’t such systems existence proofs, or do you think there’s just a huge difference between the abilities of engineers in various organizations?
I think it's a fairly noncontroversial statement. Dealing with eventual consistency is always going to be more difficult and require more careful thought and preparation than immediately consistent systems.
How many programmers do you think are out there that have only ever worked on systems that use a single RDBMS instance, and what would happen if they tried to apply their techniques to a distributed, eventually consistent environment?
Exactly. My dad was coding when DBMSes rose to prominence, and it was basically a way to take a bunch of things that were hard to think about and sweep them under the rug. People wrote plenty of good software before they existed, but if you wanted to write a piece of data, you had to think about which disk and where on the disk and exactly the record format. Most programmers just wanted a genie that they could give data to and later ask for it back.
It's the same today, but worse. Most programmers still want a simple abstraction that lets them just build things. But now it's not just which sector on which disk, but also which server in which data center on which continent, while withstanding the larger number of failure modes at that scale.
When necessary, people can explicitly address that complexity. But it has a big cost, a high cognitive load.
The biggest challenge, in my experience, is explaining the weak consistency guarantees to stakeholders, especially in a QA setting (e.g. product owner demos the product to colleagues, numbers are not immediately up-to-date).
Great read, thank you for sharing. Do you have any opinion on the design of Eris[0]? Consistency is achieved with extra hardware, but that hardware is a network-level sequencer.
AFAIK Eris (and previous similar work from the same group) assume network ordering guarantees that can be provided in a single datacenter but probably not in a WAN setting. This discussion is about distributed and geo-replicated databases.
I think it would be helpful to move the disclaimer to the top of the post, just for clarity's sake.
Is there a forecasted date on the release of the independent Jepsen study? Who is performing it? Thanks!!
Is there a "best-of-both-worlds" approach that could work, or are these two approaches mutually exclusive? I have to imagine that time drift can eventually reconciled with some kind of time delta.