The algorithm does seem nicer than a 2PC protocol in that there is no need for a...

pbailis · on May 28, 2013

> However it doesn't appear to directly include semantics for aborting transactions which is a pretty important part of a distributed transaction protocol.

Yep, I left this out to avoid confusion at first. There are some details in the "What just happened?", but the basic idea is that any aborted write will be stuck in "pending." Same for failed writes; writers won't see these. The algorithm presented actually guarantees "Read Committed" ACID isolation.

> But having a reliable node failure monitor that can react fast enough to ensure availability is really the hard part.

Well, you'll remain available for reads and writes, but the size of "pending" might grow. You essentially need asynchronous distributed garbage collection, which will stall in the presence of partitions and may require the failure detectors I mentioned.

> The paper does talk about how non-overlapping transactions won't block each other (which is nice but not a solution)

I don't see how this isn't a solution for transactions that desire last-writer-wins semantics. If, as in the examples I listed, writes commute, then a blocked write shouldn't stall others. If you want to prevent Lost Update or Write Skew anomalies (i.e., concurrent update), then you'll have to give up availability and/or block.

IgorPartola · on May 29, 2013

> I don't see how this isn't a solution for transactions that desire last-writer-wins semantics. If, as in the examples I listed, writes commute, then a blocked write shouldn't stall others.

Doesn't writes that commute mean that there was no contention to begin with? Ideally, if I have a balance of $100 in my account and try to spend $60 in two different transactions, one should come back as failed before the purchases are complete.

> If you want to prevent Lost Update or Write Skew anomalies (i.e., concurrent update), then you'll have to give up availability and/or block.

There is a difference between giving up read and write availability. My ideal database should be read-available at all times, but guarantee that writes are atomic and durable (and give up availability for this guarantee).

On the whole, this looks pretty neat. I like the idea of the client being responsible for the writes being committed on the servers. The client is then free to choose how to implement the IO, but ultimately, if a single client experiences a failure and a single write doesn't go through, it is usually a better outcome than a write going through and then replication between two servers breaking.

What are your thoughts on quorum-based voting in distributed systems? E.g.: your protocol but with the requirement that a write is considered stable if only most (vs all) of the servers involved have it marked as "good".

pbailis · on May 29, 2013

Good questions!

> Doesn't writes that commute mean that there was no contention to begin with? Ideally, if I have a balance of $100 in my account and try to spend $60 in two different transactions, one should come back as failed before the purchases are complete.

Whether or not conflicting writes are a problem or not depends on the application semantics. For example, if I'm, just adding items to a set, then my updates commute. But if I have a constraint that says that elements in the set need to be unique, then my updates don't (logically) commute any more. Ultimately, this is application-specific. The "CALM Principle" (http://www.bloom-lang.net/calm/, http://vimeo.com/album/2258285/video/53904989) captures this notion of "logical monotonicity" resulting in safe operation despite concurrency. Most of the applications I mentioned, like 2i updates and the social graph example, commute.

For non-commutative operations (and there are plenty), you'll need a stronger model like serializability or sequential consistency that necessarily blocks to prevent concurrent update (or otherwise aborts concurrent updates).

> My ideal database should be read-available at all times, but guarantee that writes are atomic and durable (and give up availability for this guarantee).

This is definitely one point in the spectrum; the question is whether you want to give up availability and write performance. But if you look at workloads like those in the Spanner paper, this is reasonable for many applications.

> What are your thoughts on quorum-based voting in distributed systems? E.g.: your protocol but with the requirement that a write is considered stable if only most (vs all) of the servers involved have it marked as "good".

There's a difference between quorums over replicas over the same data item and quorums over different data items. Using quorums over replicas would help ensure that the replicas provided properties like linearizability or register semantics like Dynamo or other systems that provide an option for "strong consistency." But it's not clear to me that quorums over replicas for different data items would provide the same atomicity property--does that make sense?

I intentionally left many of the issues of replication to the footnotes in the post--mostly for readability and clarity--but I believe the technique is applicable to both linearizable/"CP" and "eventually consistent"/"HA"/"AP" systems.

IgorPartola · on May 29, 2013

> For non-commutative operations (and there are plenty), you'll need a stronger model like serializability or sequential consistency that necessarily blocks to prevent concurrent update (or otherwise aborts concurrent updates).

The problem to me seems to be that the interesting problems always come across the case where updates are not commutative. For example, real-time sensor reading where frequent updates completely override previous state are very easy to get right if you don't need the latest data at all times. You simply get recent or slightly stale data, and the service stops responding if all sensor data is old. There are many solutions to this case, and some are less complex than your solution.

However, when dealing with a read-update-write type transaction where the values you write depend on the values you read will indeed require stronger guarantees. Here is where a lot of systems get into trouble. They seem to either implement the fuck it mode or attempt to do some kind of distributed locking which usually takes a huge performance hit even if the network is fine.

> This is definitely one point in the spectrum; the question is whether you want to give up availability and write performance. But if you look at workloads like those in the Spanner paper, this is reasonable for many applications.

Yes, the idea is that the system becomes read-only but the data remains consistent and online. In most workloads that I've seen this is the desired behavior. For some reason I haven't seen this behavior implemented yet, though it's possible I just haven't looked at the right data store.

> There's a difference between quorums over replicas over the same data item and quorums over different data items. Using quorums over replicas would help ensure that the replicas provided properties like linearizability or register semantics like Dynamo or other systems that provide an option for "strong consistency." But it's not clear to me that quorums over replicas for different data items would provide the same atomicity property--does that make sense?

Unfortunately, you lost me here. I am thinking of the quorum over each register as being the pass-fail for whether a transaction can go on. Basically, client A connects to servers X and Y and says "set t = 1; set s = 2; ... commit;". During the ..., client B connects to server Y and Z and says "set u = 9; set t = 3;". Here, client B should fail since client A has not committed. This is determined by the fact that the majority of the servers (Y and Z) cannot agree that t is available for writing. In this case, client B will receive a "success" from Z and a "fail" from Y, which will prompt it to roll back the transaction and start over.

In other words, move the responsibility to coordinate a successful write from the datastore nodes to the single point: the client.

pbailis · on May 29, 2013

Thanks for the feedback--these are great points. (post author, btw)

> The problem to me seems to be that the interesting problems always come across the case where updates are not commutative.

I agree that many application-level integrity constraints can't be satisfied with commutative updates. However, I've been surprised how frequently they can work for many web applications and how frequently "fuck-it" integrity constraint maintenance is employed given i.) faster database operation and 2.) asynchronous compensation mechanisms (e.g., bank overdraw fees) in the event of constraint violation. But your point is well-taken!

> Here, client B should fail since client A has not committed.

I think I understand your example, though I'm not clear as to whether s and t are stored on both X and Y or separately as s on X and t on Y. But, in general, writes in 'pending' should not block the insertion of other writes into 'pending' (in 2PC parlance, prepared but non-committed transactions should not block other transactions from preparing). This is fundamental to the non-blocking property of the algorithm here (i.e., http://www.bailis.org/blog/non-blocking-transactional-atomic...). So if client A hasn't committed, it shouldn't stop client B from committing. If client A and client B's writes don't commute, then the clients should use a stronger protocol/isolation level than the effective Read Committed that the NBTA algorithm here provides (doable but requires blocking). Does that make sense?

IgorPartola · on May 29, 2013

X, Y, and Z in this example would all have complete copies of the data, so s and t are stored on all three nodes. Client B shouldn't be blocked, but rather told that an error occurred and that client B is trying to modify data that is in flight. In other words writes may fail, but they fail atomically, and in such a way that the client may retry without fear of leaving data in an inconsistent state.

I suppose this method is somewhat similar in philosophy to Software Transactional Memory, but is distributed. The hard part here seems to be things like replication and garbage collection. What if client A fails before it commits? How long should the cluster lock the values of s and t before client B is able to modify them? I suppose some type of heartbeat from client to server would be good. Since there would be a strong guarantee that if a transaction fails, just retry it and nothing bad happens, then client A can come back online and try again once connectivity is restored.