Ayende @ Rahien

filter by tags archive

architecture (618) rss
bugs (451) rss
challanges (123) rss
community (381) rss
databases (481) rss
design (896) rss
development (647) rss
hibernating-practices (72) rss
miscellaneous (592) rss
performance (397) rss
programming (1093) rss
raven (1459) rss
ravendb.net (545) rss
reviews (184) rss

2025
- August (6)
- July (7)
- June (7)
- May (10)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

RavenDB Workshops - Deep dive into practical use of Document Data Modeling

Jan 31 2019

Using TLS in RustHandling messages out of band

time to read 2 min | 345 words

Tweet Share Share 0 comments

Tags:

I’m pretty much done with my Rust protocol impl. The last thing that I wanted to try was to see how it would look like when I allow for messages to be handled out of band.

Right now, my code consuming the protocol library looks like this:

This is pretty simple, but note that the function definition forces us to return a value immediately, and that we don’t have a way to handle a command asynchronously.

What I wanted to do is to change things around so I could do that. I decided to implemented the command:

remind 15 Nap

Which should help me remember to nap. In order to handle this scenario, I need to provide a way to do async work and to keep sending messages to the client. Here was the first change I made:

Instead of returning a value from the function, we are going to give it the sender (which will render the value to the client) and can return an error if the command is invalid in some form.

That said, it means that the echo implementation is a bit more complex.

There is… a lot of ceremony here, even for something so small. Let’s see what happens when we do something bigger, shall we? Here is the implementation of the reminder handler:

Admittedly, a lot of that is error handling, but there is a lot of code here to do something that simple. Compare that to something like C#, where the same thing could be written as:

I’m not sure that the amount of complexity that is brought about by the tokio model, even with the async/await macros is worth it at this point. I believe that it needs at least a few more iterations before it is going to be usable for the general public.

There is way too much ceremony and work to be done, and a single miss and you are faced with a very pissed off compiler.

Jan 30 2019

Data modeling with indexesEvent sourcing–Part I

time to read 5 min | 875 words

Tweet Share Share 8 comments

Tags:

raven
design

In this post, I want to take the notion of doing computation inside RavenDB’s indexes to the next stage. So far, we talked only about indexes that work on a single document at a time, but that is just the tip of the iceberg of what you can do with indexes inside RavenDB. What I want to talk about today is the ability to do computations over multiple documents and aggregate them. The obvious example is in the following RQL query:

That is easy to understand, it is simple aggregation of data. But it can get a lot more interesting. To start with, you can add your own aggregation logic in here, which open up some interesting ideas. Event Sourcing, for example, is basically a set of events on a subject that are aggregated into the final model. Probably the classiest example of event sourcing is the shopping cart example. In such a model, we have the following events:

AddItemToCart
RemoveItemFromCart
PayForCart

Here what these look like, in document form:

We add a couple of items to the cart, remove excess quantity and pay for the whole thing. Pretty simple model, right? But how does this relate to indexing in RavenDB?

Well, the problem here is that we don’t have a complete view of the shopping cart. We know what the actions were, but not what its current state is. This is where our index come into play, let’s see how it works.

The final result of the cart should be something like this:

Let’s see how we get there, shall we?

We’ll start by processing the add to cart events, like so:

As you can see, the map phase here build the relevant parts of the end model directly. But we still need to complete the work by doing the aggregation. This is done on the reduce phase, like so:

Most of the code here is to deal with merging of products from multiple add actions, but even that should be pretty simple. You can see that there is a business rule here. The customer will be paying the minimum price they encountered throughout the process of building their shopping cart.

Next, let’s handle the removal of items from the cart, which is done in two steps. First, we map the remove events:

There are a few things to note here, the quantity is negative, and the price is zeroed, that necessitate changes in the reduce as well. Here they are:

As you can see, we now only get the cheapest price, above zero, and we’ll remove empty items from the cart. The final step we have to take is handle the payment events. We’ll start with the map first, obviously.

Note that we added a new field to the output. Just like we set the Products fields in the pay for cart map to empty array, we need to update the rest of the maps to include a Paid: {} to match the structure. This is because all the maps (and the reduce) in an index must output the same shape out.

And now we can update the reduce accordingly. Here is the third version:

This is almost there, but we still need to do a bit more work to get the final output right. To make things interesting, I changed things up a bit and here is how we are paying for this cart:

And here is the final version of the reduce:

And the output of this for is:

You can see that this is a bit different from what I originally envisioned it. This is mostly because I’m bad at JavaScript and likely took many shortcuts along the way to make things easy for myself. Basically, I was easier to do the internal grouping using an object than using arrays.

Some final thoughts:

A shopping cart is usually going to be fairly small with a few dozens of events in the common case. This method works great for this, but it will also scale nicely if you need to aggregate over tens of thousands of events.
A key concept here is that the reduce portion is called recursively on all the items, incrementally building the data until we can’t reduce it any further. That means that the output we have get should also serve as the input to the reduce. This take some getting used to, but it is a very powerful technique.
The output of the index is a complete model, which you can use inside your system. I the next post, I’ll discuss how we can more fully flesh this out.

If you want to play with this, you can get the dump of the database that you can import into your own copy of RavenDB (or our live demo instance).

Jan 29 2019

Using TLS in RustThe complexity of async, macros and madness

time to read 3 min | 410 words

Tweet Share Share 0 comments

Tags:

After a lot of trouble, I’m really happy that I was able to build an async I/O implementation of my protocol. However, for real code, I think that I would probably recommend using with the sync API instead, since at least that is straightforward and doesn’t incur so much overhead at development time. The async stuff is still very much a “use at your own risk” kind of deal from my perspective. And I can’t imagine trying to use it in a large project and no suffering from the complexity.

As a good example, take a look at the following bit of code:

It doesn’t seem to be doing much, right? And it is clear what the intent of the code is.

However, if you try to compile this code you’ll get:

Now, it took me a long while to figure out what is going on. The issue is that the code I’m seeing isn’t the actual code, because of macro expansions.

So let’s resolve this and see what the expanded code looks like:

This is after formatting, of course, but it certainly looks scary. Glancing at this code doesn’t tell me what the problem was, so I tried replacing the method with the expanded result, and I got the same error, but this time I got it on a line that helped me figure it out. Here is the issue:

We use the ? to return early from the poll method, and the Receiver I’m using in this case is defined to have a Result<String, ()>, so this is the cause of the problem.

I returned my own error type as a result, giving me the ability to convert from (), but that was a really hard thing to resolve.

It might be better to have Rust also offer to show the error on the expanded code by default, because it was somewhat of a chore to actually get to this.

What made this oh so confusing is that I had the exact same code, but using a Stream<String, io:Error> that worked, obviously. But it was decidedly non obvious to see what was the difference between two identical pieces of code.

Jan 28 2019

Large, interconnected, in memory model

time to read 5 min | 962 words

Tweet Share Share 10 comments

Tags:

I got into an interesting discussion about Event Sourcing in the comments for a post and that was interesting enough to make a post all of its own.

Basically, Harry is suggesting (I’m paraphrasing, and maybe not too accurately) a potential solution to the problem of having the computed model from all the events stored directly in memory. The idea is that you can pretty easily get machines with enough RAM to store stupendous amount of data in memory. That will give you all the benefits of being able to hold a rich domain model without any persistence constraints. It is also likely to be faster than any other solution.

And to a point, I agree. It is likely to be faster, but that isn’t enough to make this into a good solution for most problems. Let me to point out a few cases where this fails to be a good answer.

If the only way you have to build your model is to replay your events, then that is going to be a problem when the server restarts. Assuming a reasonably size data model of 128GB or so, and assuming that we have enough events to build something like that, let’s say about 0.5 TB of raw events, we are going to be in a world of hurt. Even assuming no I/O bottlenecks, I believe that it would be fair to state that you can process the events at a rate of 50 MB/sec. That gives us just under 3 hours to replay all the events from scratch. You can try to play games here, try to read in parallel, replay events on different streams independently, etc. But it is still going to take time.

And enough time that this isn’t a good technology to have without a good backup strategy, which means that you need to have at least a few of these machines and ensure that you have some failover between them. But even ignoring that, and assuming that you can indeed replay all your state from the events store, you are going to run into other problems with this kind of model.

Put simply, if you have a model that is tens or hundreds of GB in size, there are two options for its internal structure. On the one hand, you may have a model where each item stands on its own, with no relations to other items. Or if there are any relations to other items, they are well scoped to the a particular root. Call it the Root Aggregate model, with no references between aggregates. You can make something like that work, because you have a good isolation between the different items in memory, so you can access one of them without impacting another. If you need to modify it, you can lock it for the duration, etc.

However, if your model is interconnected, so you may traverse between one Root Aggregate to another, you are going to be faced with a much harder problem.

In particular, because there are no hard breaks between the items in memory, you cannot safely / easily mutate a single item without worrying about access from another item to it. You could make everything single threaded, but that is a waste of a lot of horsepower, obviously.

Another problem with in memory models is that they don’t do such a good job of allowing you to rollback operations. If you run your code mutating objects and hit an exception, what is the current state of your data?

You can resolve that. For example, you can decide that you have only immutable data in memory and replace that atomically. That… works, but it requires a lot of discipline and make it complex to program against.

Off the top of my head, you are going to be facing problems around atomicity, consistency and isolation of operations. We aren’t worried about durability because this is purely in memory solution, but if we were to add that, we would have ACID, and that does ring a bell.

The in memory solution sounds good, and it is usually very easy to start with, but it suffer from major issues when used in practice. To start with, how do you look at the data in production? That is something that you do surprisingly often, to figure out what is going on “behind the scenes”. So you need some way to peek into what is going on. If your data is in memory only, and you haven’t thought about how to explore it to the outside, your only option is to attach a debugger, which is… unfortunate. Given the reluctance to restart the server (startup time is high) you’ll usually find that you have to provide some scripting that you can run in process to make changes, inspect things, etc.

Versioning is also a major player here. Sooner or later you’ll probably put the data inside a memory mapped to allow for (much) faster restarts, but then you have to worry about the structure of the data and how it is modified over time.

None of the issues I have raised is super hard to figure out or fix, but in conjunction? They turn out to be a pretty big set of additional tasks that you have to do just to be in the same place you were before you started to put everything in memory to make things easier.

In some cases, this is perfectly acceptable. For high frequency trading, for example, you would have an in memory model to make decisions on as fast as possible as well as a persistent model to query on the side. But for most cases, that is usually out of scope. It is interesting to write such a system, though.

Jan 25 2019

Using TLS in RustGetting async I/O with tokio, second try

time to read 3 min | 588 words

Tweet Share Share 0 comments

Tags:

On my last post, I got really frustrated with tokio’s complexity and wanted to move to use mio directly. The advantages are that the programming model is pretty simple, even if actually working with is is hard. Event loops can cause your logic to spread over many different locations and make it hard to follow. I started to go that path until I figure out just how much work it would take. I decided to give tokio a second change, and at this point, I looked into attempts to provide async/await functionality to Rust.

It seems that at least some work is already available for this, using futures + some Rust macros. That let me write code that is much more natural looking, and I actually managed to make it work.

Before I get to the code, I want to point out some concerns that I have right now. The futures-await crate (and indeed, all of tokio) seems to be in a state of flux. There is an await in tokio, and I think that there is some merging around of all of those libraries into a single whole. What I don’t know, and can’t find any information about, is what I should actually be using, and how all the pieces come together. I have to note that even with async/await, the programming model is still somewhat awkward, but it is at a level that I can live with. Here is how I built it.

First, we need to accept connections, which is done like so:

Note that I have two #[async[ annotations. One for the method as a whole and one for the for loop. This just accept the connection and spawn a task to handle that, the most interesting tidbits are in the actual processing of the connection:

You can see that this is fairly straightforward code. We first do the TLS handshake, then we validate the certificate. If there is an auth error, we send it to the user and back off. If we are successful, however, things get interesting.

I create a channel, which allow me to split off the read and write portions of the task. This means that I can send results out of order, if I wanted to, which is great for the actual protocol handling. The first thing to do is to send the OK string to the client, so they know that we successfully connected, then we spawn the read/write tasks. The write task is pretty simple, overall:

You can see the funny .0 references, which is an artifact of the fact that the write_all() function consumes the writer we pass to it and return (a potentially different) writer in the result. This is pretty common for functional languages.

I’m pretty sure that I can avoid the two calls to write_all for the postfix, but that is easier for now.

Processing the commands is simple as well:

For each command we support, we have an entry on the server configuration and we fetch and invoke it. The result of the command will be written to the client by the write task. Right now we have a 1:1 association between them, but this is now easily broken.

And finally, having an actually command run and running the server itself:

This is pretty simple now, and it give us a nice model to program commands and responses.

I pushed the whole code to this branch, if you care to look at it.

I have some more comments about this code, but I’ll reserve them for another post.

Jan 24 2019

Pesky code review comments

time to read 5 min | 866 words

Tweet Share Share 2 comments

Tags:

A large portion of my day to day tasks is to review code. I’m writing this post barely two weeks into the new year, and we already had over 150 PRs going into RavenDB alone.

As a result, I’ve gotten sensitive to certain issues. For example, the following is a suggestion made for fixing an issue in this method declaration:

This is a piece of code (in C) that is meant to handle some low level details for RavenDB. We use the CLR coding conventions for C#, but for C, we have chosen to use a different convention, using snake_case for methods, arguments and variables and SHOUTING_CASE for constants / defines. When reading through the code, I marked this violation of the naming convention for a fix.

This may seem minor, but it is probably annoying for the author of the code. They are interested in comments about the code functionality and utility. Why spend any time on something that doesn’t really have any impact? Both forms of the parameter name are just as readable to me, after all.

Before I get to this part, I want to show another piece of code. Or, to be rather more exact, two pieces of code. One of the reasons that we are using C code is that we can abstract large parts of the underlying platform inside the native code. That means that we have certain parts of the code that are written twice. Once for Windows and once for Linux.

Here is some code for Windows:

And here is the same code for Linux:

You can see that this is pretty much the same thing, just calling the different APIs for each platform. Once thing to notice here is that part of this method’s task is to ensure that the file that we open is at least as big as the initially requested size.

In Windows, to increase or decrease the file size you call SetFilePointer() followed by SetEndOfFile(). On Linux, you have fallocate() and ftruncate()*.This is reflected in the code. The Windows code has a single method to do this work and the Linux method has two methods. rvn_allocate_file_space() and rvn_truncate_file() which isn’t shown here.

* Actually, you might have fallocate(). Some file systems do not support it, and you need to use another workaround.

One of my code review comments has been that this need to be fixed, that we should have a _resize_file() method for Linux that would simple call the appropriate method based on the file size. But the question is, why?

These are two separate implementations for two separate operating systems. We are already creating the higher level abstraction level with operations that hide many system details. Why do I want to have a passthrough method just because the Windows code has this method?

The answer here, as in the case above with the parameter name, is the same. Consistency.

This it most obvious in the naming convention, but it is the same reasoning I had when I wanted to have the same method structure for both Linux and Windows.

Consistency is key for being able to slog through a lot of code. It is how I (and the rest of the team) can go through thousands of lines of code per week and understand what is going on. Because when we look at a piece of code, it follow certain conventions and structure. Reading the code is easy because we can ignore a lot of cruft around it and focus on what is going on.

In the case of the Windows / Linux methods, I first read one method and then the next, making sure that we are doing the same thing on all platforms. The different behavior (resize vs. allocate) was very obvious to me, which meant that I had to stop, go and look at each method’s implementation to figure out whatever there is any meaningful difference between them. That was a chore, and it would only become worse over time as we add additional functionality, so anything that isn’t different because it has to be different should match.

In general, I like code reviews where I can scan through the changes and not see the code, but it’s purpose. That happens when there isn’t anything there that I really have to think about and the intent is clear.

When writing in C#, we have decades (literally) of experience and organizational inertia that push us in the right direction. When push C code into the repository, I started to actually pay attention to those details explicitly, because I suddenly need to.

This is apparent in code reviews, but it isn’t just the case of me trying to make my own tasks easier. Code is read a lot more often than it is written, and focusing on making the code itself boring will pay off, because what the code is doing is what should be interesting in the long run.

Jan 23 2019

Using TLS in Rusttokio ain’t for mere mortals

time to read 2 min | 346 words

Tweet Share Share 3 comments

Tags:

I kept going with tokio for a while, I even got something that I think would eventually work. The whole concept is around streams, so I create a way to generate them. This is basically taking this code and making it async.

I gave up well into the second hour. Here is where I stopped:

I gave up when I realized that the reader I’m using (which is SslStream) didn’t seem to have poll_read. The way I’m reading the code, it is supposed to, but I just threw up my hands at disgust at this time. If it this hard, it ain’t going to happen.

I wrote significant amount of async code in C# at the time when events and callbacks were the only option and then when the TPL and ContinueWith was the way to go. That was hard, and async/await is a welcome relief, but the level of frustration and “is this wrong, or am I really this stupid?” that I got midway through is far too much.

Note that this isn’t even about Rust. Some number of issues that I run into were because of Rust, but the major issue that I have here is that I’m trying to write a function that can be expressed in a sync manner in less than 15 lines of code and took me about 10 minutes to write the first time. And after spending more hours than I’m comfortable admitting, I couldn’t get it to work. The programming model you have here, even if everything did work, means that you have to either decompose your behavior to streams and interact with them in this manner or you put everything as nested lambdas.

Either option doesn’t make for a nice model to work with. I believe that there is another async I/O model for Rust, the MIO crate, which is based on the event loop model. I’ve already implemented that in C, so that might be a more natural way to do things.

Jan 22 2019

Investigating self inflicted wounds: The SSL failure on the Linux build server

time to read 3 min | 563 words

Tweet Share Share 0 comments

Tags:

bugs

A nasty cousin of It Works On My Machine is the It Fails On That Machine. It is nasty because you know that there is something wrong, but you can’t reproduce this.

The machine in question was our Linux build agent, and the failure in question was a set of failing tests that failed to perform certain operation when TLS was enabled. The problem? They were failing with I/O errors, but only with TLS, and the connection was using localhost. Further investigation showed that the most likely reason for the failure was a timeout. But how could that be? For fun, sometimes, the test passed. So it wasn’t an issue of a firewall of some such. Testing using openssl s_server and connecting to it manually didn’t show any issues.

The failures reproduced only on that particular build machine. Trying to reproduce the failure on other Linux machines failed. The problem was that these other machines had different kernel version (shouldn’t matter, probably) or different openssl versions (which likely mattered). We started to investigate what the issue was and tried to setup a secured server on the box and connect to it.

Which worked, so that sucked. Running the tests again showed them failing again… which was confusing.

Eventually we figured out that we couldn’t make it fail when running a server outside the tests, but we could observe that it was slow. Slow as in, took ~2.5 seconds to reply to the client’s hello message.

We tried it in the debugger, and whenever we paused it, it was always some variant of the following:

That led us to believe that there is something with openssl that could cause this slowdown. We tried to use strace to understand what it was doing, and we got stuff like this:

To be rather exact, we got over 300,000 calls to openat with some file in that directory.

That… didn’t seem right, to us. So we looked a bit deeper. As you can see, during the processing of an SSL connection, we can to X509Chain.Build, and that end up calling to X509Store.Certificates, and that end up reading all the certificates that you have on the “~/.dotnet/corefx/cryptography/x509stores/ca” directory.

That directory had…. 2,103 certificates files in it. And every single time that we had to do anything with a certificate, we would go and scan through all of them, including paying all the cryptographic costs of verifying all those certs. If we had multiple threads doing that at the same time, we would run into starvation issues. Just a single pass through that was 2.5 seconds for a single core. The test scenario had 20 connections at a minimum going at once, and that was the reason for the timeouts. We basically run out of CPU because we were spending all of those cycles on the certs.

And why did we have so many certs? The tests create a certificate every time that they run, and register that in the X509Store to make sure that certain features work. But we never delete it from the store…. And over time, we had more and more certs going on there, until the load was big enough to start breaking things.

Opps.

Jan 21 2019

Using TLS in RustGoing to async I/O with Tokio

time to read 5 min | 941 words

Tweet Share Share 0 comments

Tags:

Now that we have a secured and authentication connection, the next stage in making a proper library is to make it run more than a single connection at time. I could have use a thread per connection, of course, or even use a thread pool, but neither of those options is valid for the kind of work that I want to see, so I’m going to jump directly into async I/O in Rust and see how that goes.

The sad thing about this is that I expect that this will make me lose some / all of the nice API that I get for OpenSSL in the sync mode.

Async in Rust is handled by a crate called tokio, and there seems to be active work to bring async/await to the language itself. In the meantime, we have to make do with the usual facilities, which ought to make this interesting.

It actually looks like there is a crate that gives pretty nice handling of tokio async I/O and OpenSSL so that is encouraging. However, as part of trying to re-write everything in tokio style, I got the compiler very upset with me. Here is (partial) error message:

Last time I had to parse such errors, I was working in C++ templated code and the year was 1999.

And here is the piece of code it so dislikes:

I googled around and there is this detailed answer on a similar topic that frankly, frightened me. I shouldn’t have to dig this deeply and have to start drawing diagrams on so many disparate pieces of the code just to figure out a compiler error.

Let’s try to break it to its component parts and see if that make sense, I reduce the code in question to just:

Got another big scary error message. Okay, let’s try it without the OpenSSL stuff?

This produce the same error, but in a much less scary tone:

Okay, now this looks about as simple as it can be. And now the fix is pretty obvious:

The key to understand here, I believe (I haven’t tested it yet) that the write_all call will either perform its work or schedule it, so any future work based on it should go in a nested and_then call. So the result of the single for_each invocation is not the direct continuation of the previous call.

That is fine, I’ll deal with that, I guess.

Cue here about six hours of programming montage.

I have been programming over 20 years, I like to think that I have been around the block a few times. And the simple task of reading a message from TCP using async I/O took me far too long. Here is what I eventually ended up with:

This is after fighting with the borrow checker (a lot, it ended up winning), trying to grok my head around the model that tokio has. It is like they took the worst parts of async programming, married it to stream programming’s ugly second cousin and then decided to see if any of the wedding guests is open for adoption.

And if the last sentence doesn’t make sense to you, you are welcome, that is how I felt at certain points. Here is one of the errors that I run into:

What is this string, where did it come from and why do we have a unit “()” there? Let me see if I can explain what is going on here. Here is a very simple bit of code that would explain things.

And here is the error it generates:

The problem is that spawn is expecting a future that results a result that has no meaning, something like: Future<Result<(), ()>>. This make sense, since there isn’t really anything that it can do with whatever the result is. But the error can be really confusing. I spent a lot of time trying to actually parse this, then I had to go and check the signatures of the method involved, and then I had to reconstruct what are the generic parameters that are required, etc.

The fix, btw, is this:

Ask yourself how long it would take you to figure what the changes between these versions of the code are without the marker.

Anyway, although I’m happy that I got something done, this approach is really not sustainable. I’m pretty sure that I’m either doing something wrong or missing something. It shouldn’t be this hard. I got some ideas that I want to try, which I’ll talk about in the next post.

Jan 18 2019

The role of domain model with CQRS / Event Sourcing

time to read 6 min | 1043 words

Tweet Share Share 10 comments

Tags:

I had some really interesting discussions while I was in CodeMash, and a few of them touched on modeling concerns with non trivial architectures. In particular, I was asked about my opinion on the role of OR/M in systems that mostly do CQRS, event processing, etc.

This is a deep question, because on first glance, your requirements from the database are pretty much just:

INSERT INTO Events(EventId, AggregateId, Time, EventJson) VALUE (…)

There isn’t really the need to do anything more interesting than that. The other size of that is a set of processes that operate on top of these event streams and produce read models that are very simple to consume as well. There isn’t any complexity in the data architecture at all, and joy to world, etc, etc.

This is true, to an extent. But this is only because you have moved a critical component of your system, the beating heart of your business. The logic, the rules, the thing that make a system more than just a dumb repository of strings and numbers.

But first, let me make sure that we are on roughly the same page. In such a system, we have:

Commands – that cannot return a value (but will synchronously fail if invalid). These mutate the state of the system in some manner.
Events – represent something that has (already) happened. Cannot be rejected by the system, even if they represent invalid state. The state of the system can be completely rebuilt from replaying these events.
Queries – that cannot mutate the state

I’m mixing here two separate architectures, Command Query Responsibility Separation and Event Sourcing. They aren’t the same, but they often go together hand in hand, and it make sense to talk about them together.

And because it is always easier for me to talk in concrete, rather than abstract, terms, I want to discuss a system I worked on over a decade ago. That system was basically a clinic management system, and the part that I want to talk about today was the staff scheduling option.

Scheduling shifts is a huge deal, even before we get to the part where it directly impacts how much money you get at the end of the month. There are a lot of rules, regulations, union contracts, agreement and bunch of other staff that relate to it. So this is a pretty complex area, and when you approach it, you need to do so with the due consideration that it deserves. When we want to apply CQRS/ES to it, we can consider the following factors:

The aggregates that we have are:

The open scheduled for two months for now. This is mutable, being worked on by the head nurse and constantly changes.
The proposed scheduled for next month. This one is closed, changes only rarely and usually because of big stuff (something being fired, etc).
The planned schedule for the current month, frozen, cannot be changed.
The actual schedule for the current month. This is changed if someone doesn’t show to their shift, is sick, etc.

You can think of the first three as various stages of a PlannedScheduled, but the ActualSchedule is something different entirely. There are rules around how much divergence you can have between the planned and actual schedules, which impact compensation for the people involved, for example.

Speaking of which, we haven’t yet talked about:

Nurses / doctors / staff – which are being assigned to shifts.
Clinics – a nurse may work in several different locations at different times.

There is a lot of other stuff that I’m ignoring here, because it would complicate the picture even further, but that is enough for now. For example, regardless of the shifts that a person was assigned to and showed up, they may have worked more hours (had to come to a meeting, drove to a client) and that complicated payroll, but that doesn’t matter for the scheduling.

I want to focus on two actions in this domain. First, the act of the head nurse scheduling a staff member to a particular shift. And second, the ClockedOut event which happens when a staff member completes a shift.

The ScheduleAt command place a nurse at a given shift in the schedule, which seems fairly simple on its face. However, the act of processing the command is actually really complex. Here are some of the things that you have to do:

Ensure that this nurse isn’t schedule to another shift, either concurrently or too close to another shift in a different address.
Ensure that the nurse doesn’t work with X (because issues).
Ensure that the role the nurse has matches the required parameters for the schedule.
Ensure that the number of double shifts in a time period is limited.

The last one, in particular, is a sinkhole of time. Because at the same time, another business rule says that we must give each nurse N number of shifts in a time period, and yet another dictates how to deal with competing preferences, etc.

So at this point, we have: ScheduleAtCommand.Execute() and we need to apply logic, complex, changing, business critical logic.

And at this point, for that particular part of the system, I want to have a full domain, abstracted persistence and be able to just put my head down and focus on solving the business problem.

The same applies for the ClockedOut event. Part of processing it means that we have to look at the nurse’s employment contract, count the amount of overtime worked, compute total number of hours worked in a pay period, etc. Apply rules from the clinic to the time worked, apply clauses from the employment contract to the work, etc. Again, this gets very complex very fast. For example, if you have a shift from 10PM – 6 AM, how do you compute overtime? For that matter, if this is on the last day of the month, when do you compute overtime? And what pay period do you apply it to?

Here, too, I want to have a fully fleshed out model, which can operate in the problem space freely.

In other words, a CQRS/ES architecture is going to have the domain model (and some sort of OR/M) in the middle, doing the most interesting things and tackling the heart o complexity.

Oren Eini

Oren Eini

CEO of RavenDB

Using TLS in RustHandling messages out of band

Data modeling with indexesEvent sourcing–Part I

Using TLS in RustThe complexity of async, macros and madness

Large, interconnected, in memory model

Using TLS in RustGetting async I/O with tokio, second try

Pesky code review comments

Using TLS in Rusttokio ain’t for mere mortals

Investigating self inflicted wounds: The SSL failure on the Linux build server

Using TLS in RustGoing to async I/O with Tokio

The role of domain model with CQRS / Event Sourcing

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed