Rhino DHT and failover and replication, on my!
My initial design when building Rhino DHT was that it would work in a similar manner to Memcached, with the addition of multi versioned values and persistence. That is, each node is completely isolated from all the rest, and it is the client that is actually creating the illusion of distributed cohesion.
The only problem with this approach is reliability. That is, if a node goes down, all the values that are stored in it are gone. This is not a problem for Memcached. If the node is down, all you have to do is to hit the actual data source. Memcached is not a data store, it is a cache, and it is allowed to remove values when you want it.
For Rhino DHT, that is not the case. I am using it to store the saga details for Rhino Service Bus, as well as storing persistent state.
The first plan was to use it as is. If a node is down, it would cause an error during load saga state stage (try to say that three times fast!), which would eventually move the message to the error queue, when the node came back up, we could move the messages from the error queue to the main queue and be done with it.
My current client had some objections to that, from his perspective, if any node in the DHT was down, the other nodes should take over automatically, without any interruption of service. That is… somewhat more complex to handle.
Well, actually, it isn’t more complex to handle. I was able to continue with my current path for everything (including full transparent failover for reads and writes).
What I was not able to solve, however, was how to handle a node coming back up. Or, to be rather more exact, I run into a problem there because the only way to solve this cleanly was to use messaging. But, of course, Rhino Service Bus is dependent on Rhino DHT. And creating a circular reference would just make things more complex, even if it was broken with interfaces in the middle.
Therefore, I intend on merging the two projects.
Also, two points if you can tell me why I have used this image for this post.
The design for the new version of Rhino DHT is simple. We continue to support only three operations on the wire, Put, Get and Remove. But we also introduced a new notion. Failover servers. Every node in the DHT has a secondary and tertiary nodes defined to it. Those nodes are also full fledged nodes in the DHT, capable of handling their own stuff.
During normal operation, any successful Put or Remove operation will be sent via async messages to the secondary and tertiary nodes. If a node goes down, the client library is responsible for detecting that and moving to the secondary node, and the tertiary one if that is down as well. Get is pretty simple in this regard, as you can imagine, the node needs to simply serve the request from local storage. Put and Remove operations are more complex, the logic for doing this is the same as always, include all the conflict resolution, etc. But in addition to that, the Put and Remove requests will generate async messages to the primary and tertiary nodes (if using the secondary as fallback, and primary and secondary if using the tertiary as fallback).
That way, when the primary come back up, it can catch up with work that was done while it was down.
That leaves us with one issue, where do we store the data about the actual nodes. That is, the node listing, which is the secondary / tertiary to which, etc.
There are a few constraints here. One thing that I really don’t want to do is to have to have duplicate configuration. Even worse than that is the case of conflicting configurations. That can really cause issues. We deal with that by defining a meta-primary and a meta-secondary for the DHT as well. Those will keep track of the nodes in the DHT, and that is where we would configure who goes where. Replication of this value between the two meta nodes is automatic, based on the information in the primary, the secondary node is a read only copy, in case the primary goes down.
The only configuration that we need for the DHT then is the URL for the meta-primary/meta-secondary.
Another important assumption that I am making for now is that the DHT is mostly static. That is, we may have nodes coming up and down, but we don’t have to support nodes joining and leaving the DHT dynamically. This may seem like a limitation, but in practice, this isn’t something that happen very often, and it significantly simplifies the implementation. If we need to add more nodes, we can do it on deployment boundary, rather than on the fly.
Comments
"Lions, tigers and bears, oh my!" - from Wizard of Oz. Related to your title, but not so sure how it's related to DHT? :)
If you need a distributed cache solution for a small customer, I might suggest looking at GigaSpaces which has a free license for companies with less than $5M annual revenue.
Here's how google does something similar. I know it doesn't exactly map to what you are doing but may spark an idea for you.
code.google.com/.../mapreduce-tutorial.html
"To detect failure, the master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling.
Completed map tasks are re-executed when failure occurs because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global fille system."
To do this correctly you would need a Paxos implementation built into the DHT, or a separate Google Chubby-like service running an implementation.
Anon,
No, I would not.
conflict resolution is handled by the client, not by the server.
Then the clients would need it implemented :) It's really only for cases where you have multiple clients accessing the same DHT nodes and needing to perform 'interlocked' operations on them.
Anon,
a) please select a different handle. It is annoying to have anonymous comments
b) check to see how Amazon Dynamo works.
When do we can a NHibernate second level cache provider!!!!
I'd love some failover....memcached is hard to sell to a strict .NET client.
Christopher,
a) feel free to write it :-)
b) NMemcached ?
What about something like Velocity? I ignore if they support versioning right now, but definitively supports HA and replication.
HA?
HA = High Availability???
Comment preview