Rhino DHT: Concurrency handling example – the phone billing system
I got into a discussion today about how we are dealing with concurrency, and I have had a few good examples that I think worth putting in writing. The first of them is the phone billing system. This is, by nature, a distributed and concurrent system, and it is pretty easy to understand, I think.
We store the billing information for each customer (keyed by the phone number) in the DHT. The initial state looks like this:
The balance is what the account has, the call & SMS are the actions on the account. For the purpose of discussion, sending SMS costs 2$ and 1 minute call cost 5$.
And then the following happens. A phone call is made at the same time that a couple of SMSes is sent and a bill is paid. You can see that in the following picture:
Each of those actions are handled by a different node. We will deal with them in sequence, because writing parallel hard be is.
A phone call is made, so we need to record that it happened. We get the current billing information from the DHT and add a new action:
At the same time, we also send a couple of SMS messages. Again, we get the current billing information (and we get version 42), add the action and saving it back. However, we don’t have the most current version, so the DHT accepts the update and now we have two versions for key 555-5421. This is expected and normal behavior.
You should also note that we have an overdraft charge, for going over our account balance.This is something that was added to the account as part of the business logic of processing those the call. Being a responsible adult, the bill is paid at the exact time to avoid an overdraft charge. That one is handled according to the same approach, get the billing information from the DHT (and again we get version 42), modify it and save.
Now we have the following situation:
All three are valid, I have to say. When we ask the DHT to get a value by key, we will get all three versions back into a coherent vision of what actually happened.
First, I should mention that this is not a generic solution for all problems. There are likely to be problems that you’ll not be able to resolve using this approach.
One thing that you might have noticed is that each of the items is tagged with a number. In real life, it would be a guid, but no one can remember a guid by looking at it, so I made it a number that is easy to remember. This id can uniquely identify an item across multi machines and concurrent versions.
The algorithm for merging those three versions together is actually quite simple. It goes something like this:
1: public BillingStatementState Merge(BillingStatementState[] states)
2: {
3: var mergedState = new BillingStatementState();
4:
5: foreach (var balanceItem in states.SelectMany(b=>b.Balances)
6: {
7: if(mergedState.HasBalanceItem(balanceItem.Id) == false)
8: mergedState.AddBalanceItem(balanceItem);
9: }
10:
11: foreach (var item in states.SelectMany(s=>s.ActionItems))
12: {
13: if(mergedState.HasActionItem(item.Id))
14: continue;
15: mergedState.AddActionItem(item);
16: }
17:
18: mergedState.RecalculcateCharges();
19:
20: return mergedState;
21: }
22:
RecalcuateCharges is responsible to add / remove overdraft charges based on the new information.
What we are basically doing is quite simple, we copy all the new information to the new state, and we know that it is new because we have a unique id that can identify each item. The only remaining bit of complexity is that we now need to recalculate the charges.
As you’ll see in a future post, “recalculating” isn’t really it, you usually have to perform some compensating actions as well, but that is beside the point for now.
Given the above code, we can safely merge the three versions, and make them into a single big version.
The DHT will notice that the new value is the child of all current valid versions, accept the update and remove all other versions.
As I said, it is not something that can fit any scenario, but it can fit a surprisingly wide area of them.
Comments
Why 44th version (the green one) also contains that 2min call? Shouldn't that call be 43rd version's exclusive item?
The purple 44th version (the one with +20 balance) should probably be numbered 45. And here again this strange 2min call from v43. Right?
Does the merge ability planned to be embedded into DHT or to be implemented in a separate scenatio/saga?
meowth,
This is something that the client library will handle.
Check and see Rhino Service Bus' way of doing this.
That is why you should not blog about complex topics after midnight.
There were numerous errors in the diagrams. All should be fixed now
In the merge you say ' if the merged state has a balance item with the same id, then add it ', is this correct? I'm not sure how the first loop would do anything because a new BillingStatementState wouldn't contain anything right?
That is a bug in the code, sorry.
Fixed it
I don't know if you're making a joke or a typo but "writing parallel hard be is" is very funny in pirate English and basically sums up my feelings about the subject after reading your DHT posts. Arrghh.
That was intentional :-)
Version 46 is missing the 2 min call that costs $10 (332-843...btw, what does that number represent?), and caused the initial overdraft charge, isn't it?
jdn,
Thanks, fixed.
It is a number that is a numeric representation of a GUID. I didn't want to put a real guid because non one can just read them
I think that your solution works under the following assumptions:
The concurrent events have no context - they can happen regardless of state.
The concurrent events have no interactions and there is no importance to their order in time (this is kind of a special case of 1).
Why is it important? Lets say number 1 doesn't hold. Lets say we do not allow a user to call once he has a dept of 100$. If he is at 94$, and he does an sms and a 1 minute call concurrently, he should be 94+2+5=101$ in dept.
In this situation we shouldn't have let the user make both the call and the sms.
We will discover this too late - on Merge() and at that point we will be in an illegal system state.
I liked your general direction for solving concurrency issues.
I wonder if you have any other way to solve this... except locking mechanisms! :)
Ofer,
Not quite, this is valid even if you have context and interactions.
The merge tend to happen shortly after a conflict occurs, usually, it is resolved within a few hundreds milliseconds of its occuring.
You can usually tolerate an inconsistent state for that time frame. Even if you rejected a call that should have been made, the user can try calling back again immediately and get a connection.
Ofer, how do you handle the customer being at say $95 overdrawn, and the limit is $100 before she aparently gets blocked, say she is charged $1 a minute, does she get cut off after the 5 minutes? how does policy of complex systems like this work? seems to me you cant really ensure that kinda policy at this level..
I would of thought you could avoid a lot of the billing trouble by not involving cost information at this stage.. just recording all the billable things then working out at a specific moment in time (end of month say) what the costs would be when you have the complete picture.
Ayende,
If you can take those assumptions then I agree - your method of dealing with concurrency must be super-fast in performance.
I was talking about the general case.
Stephan,
I think you can decide on whatever policy that suits you, as long as you can enforce it. Cutting the call is certainly an option.
Ofer,
I recommend that you would read about the CAP theorem.
There is no general case in this situation. What you want to be able to do is to select which two you want to use at any given point in time, consistency, availability or partition.
Ayende,
I just want to add some comments which could help me to better understand the whole picture.
1 All unresolved co-existing states should be treated as events occurred at the same time. If developer still needs to know the real order of the unresolved events (who knows the real tasks where Rhino DHT could be used?) he could add special properties to the message type. The property could be used to resolve the order of the unresolved messages on the client side.
2 If developer still needs to have locking mechanism, could it be implemented using current approach?
How do we download your DHT? Thanks
rhino-tools.svn.sourceforge.net/.../rhino-dht
Comment preview