How I learned about Merklix trees (without having to become a cryptocurrency enthusiast)

composition.al 2019-06-28

In my distributed systems course a couple weeks ago, we were discussing how the anti-entropy mechanism in the Dynamo key-value store uses Merkle trees to efficiently compare the data stored on different replicas. The idea is this: suppose that two replicas (on different physical machines, perhaps on opposite sides of the planet) are each storing some large number of key-value pairs. Suppose you want to make sure that the values of the keys on replica A are the same as those of the corresponding keys on replica B. (For instance, maybe replica A got updated while replica B was unreachable.)

So, you need to compare the values of keys somehow, and this comparison must be computed somewhere — either on replica A or replica B. If you’re going to do it on replica A, then the value to be compared with has to be sent over the network from replica B, and vice versa.¹ Since these values could potentially be quite large, you can instead send a hash of the value to be compared and then compare the values by their hashes instead of by their actual contents. But if you want to do this comparison for every key-value pair, even sending the hashes could take more bandwidth than you want to use.

Merkle trees address this issue by storing the data blocks (in this case, key-value pairs) at the leaves of a tree. Each leaf node is labeled with the hash of its data block, and each non-leaf node is labeled with the hash of its children’s labels. The root of the tree, then, is labeled with a single hash value, and this single value is what you send over the network when you need to do a comparison. If the hashes agree, great! Your replicas are in the same state. If they disagree, then you need to make more comparisons, one for each child of the root (two comparisons if your tree is a binary tree; k comparisons if it’s a k-ary tree). If any of those comparisons agree, then you can ignore the children of those nodes because you know that they agree on the data, and so on.

In this way, Merkle trees enable efficient comparison of large data sets. If you have N data blocks, and there’s only one place where your replicas disagree on the data — say, the value associated with a single key — then you’ll only need to make O(log N) comparisons of hashes to find the point of disagreement, rather than N comparisons.

When I discussed this in class, the example I drew on the board had two binary trees, each with four leaves, and a key-value pair at each leaf, and with each tree having the same four keys and differing in only one of the values. I made a big show of pointing out how we could ignore the side of the tree that was the same. But then a student in my class asked a great question that I didn’t know the answer to: how does this Merkle tree thing work if the replicas don’t know about the same set of keys?

Put on the spot in class, I didn’t have an answer. Going back and looking at the Dynamo paper, now, it seems like the answer is that since all the replicas know which key ranges they’re responsible for, they can just have an entry in the Merkle tree for each key in the range, whether or not there’s actualy a value associated with that key on a given replica. Because these ranges are huge, you’d want to use an efficient “sparse Merkle tree” representation for the tree to be of tractable size.

A couple of days later, though, I happened to be hanging out with Graydon and posed the question to him, and he had a more interesting answer. He said that what you could do is store the data in a bitwise prefix tree (or radix tree, or trie, or whatever it is in your idiolect), keyed on hashes of the keys. But you also hash the values themselves and label leaves with hashes of values, and internal nodes with hashes of children’s label, in Merkle tree fashion. This took a while for me to get my head around, because there are two different things to hash. It wouldn’t make any sense to only hash the keys for the Merkle tree, because it’s the values themselves that you actually want to compare. But for the prefix tree, you do want to only hash the keys, because you need a given key-value pair to be stored at the same place in the tree, regardless of the value. The two hashes are unrelated; you wouldn’t need to use the same hash function.

The resulting data structure is called a “merkleized” (or “merkelized”; spellings vary) prefix tree, a prefix tree with a “Merkle overlay”, or my favorite, a Merklix tree. But it isn’t only for blockchain and blockchain-related program activities, despite what those search results might have you believe. Graydon told me that this was in fact the data structure he implemented for synchronizing copies of objects in the Monotone version control system, circa 2004 or so! I asked what name he used for it, and he said that at the time he just called it a Merkle tree because it didn’t occur to him that there was a version that wasn’t a prefix tree. That amuses me because these days, if I do, say, a Google image search for “merkle trees”, all I see are boring old perfect binary trees like the one I drew on the board in class.

In any case, if replicas had different sets of keys and you didn’t have a priori knowledge of key ranges: you could use Merklix trees, determine that the replicas have different sets of keys by looking only at the prefix tree structure, and then use the Merkle overlay to sync up the keys that disagree on value. This doesn’t seem to be what Dynamo actually does, but I was happy to learn about it anyway.

Presumably you would be also storing some kind of timestamp for each key, whether logical or physical, that would tell you which copy of it had been updated more recently, but that’s a whole other big ol’ can of worms.↩