Conflict-free replicated data types (CRDT) is one data structure I found truly fascinating. Basically, it enables distributed machines to eventually sync their data over time, i.e., eventual consistency.

To illustrate what I mean, consider the following scenario:

  • Two person, A and B, are doing collaborative work in a shared document, e.g., Google Docs
  • There is no centralized server managing the state b/w A and B.
  • A sends state updates to B, then B changes it's state to reflect A's updates, and vice versa.    

Now, for some unknown reason, they got disconnected from each other for some time intervals. During this time, both person made changes to the document. Once they are reconnected, how should the program resolve their updates such that both have equivalent states?

One simple approach is to append timestamps to the updates and order events based on the timestamps. This approach, however, is predicated on the assumption that both machines have synchronous clocks that are immune to drifting and modification. This may result in data corruption depending on your use case.

CRDTs are intended to address this issue. Essentially, the goal is to structure the state updates in such a way that merge conflicts cannot occur*. This is accomplished by creating a data structure that has the following properties:

  • Merging updates are commutative - which means that if we have a MERGE function that applies state updates, the order in which each update is applied should be irrelevant to the resulting state.  

   C = MERGE( A, B) = MERGE(B, A)

  • Merging updates are idempotent - which means that merging the same updates more than once results in the same state.  

   C = MERGE( A, B ) = MERGE(C , A) = MERGE(C, B)    

This is essentially combining updates in any order and frequency while producing the same result. This makes it ideal for distributed systems where communication packets may arrive duplicated, late, or in an erroneous order.

This approach is quite powerful and is widely used in distributed systems today. It's the ability to resolve conflicts in distributed databases as well as synchronize multiple nodes of a distributed system makes it an essential tool in today's systems.

Playing with CRDTs

There are a lot of implementations of CRDTs out there. In JavaScript, for instance, we have Y.js (https://github.com/yjs/yjs) and automerge (https://github.com/automerge/automerge). There’s also a Y.js demo (https://demos.yjs.dev/prosemirror/prosemirror.html) that allows you to play around with them and have your collaborative app running in just a few seconds. All messages are exchanged via webRTC while the state is managed via CRDTs. This can be a great sandbox to understand how CRDTs work.

I also attempted to build an offline-first application with Y.js-powered device syncing. The goal is to create a habit-tracking app that does not rely on a central server (https://habit-board.netlify.app).

I had a lot of fun messing around with it. Overall, CRDTs are an incredibly useful structure for distributed systems and are gaining a lot of momentum. It's definitely worth investing time in them to better understand how they work and how you can apply them.

Conclusion

In conclusion, CRDTs provide an efficient way to keep distributed systems in sync. By structuring their data in such a way that merging is commutative and idempotent, they are able to maintain the same consistency regardless of delivery order and duplicate packets. This technique is widely used in distributed systems today and is gaining more and more traction. It’s definitely worth looking into this structure and understanding how it works and how you can apply it in your own systems.

* I may be relaxing some jargon here, what I mean is that there’s a deterministic algorithm or scheme that is guaranteed to combine any state updates such that no merge conflict, that demands manual resolution, is produced.

References