Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,582
|
Comments: 51,212
Privacy Policy · Terms
filter by tags archive
time to read 3 min | 465 words

rodentia-icons_fsguard-plugin-urgent-300pxWe are testing RavenDB on a wide variety of software and hardware, and a few weeks ago one of our guys came to me with grave concern. We had a major regression in performance on Linux. And major as in 75% slower than what it used to be a few weeks ago.

Testing at that point that showed that indeed, there is a big performance gap between the same benchmark on that Linux machine and a comparable machine running Windows. That was worrying, and took us a while to figure out what was going on. The problem was that we previously had that exact same scenario. The I/O pattern that are most suitable for Linux are pretty bad for Windows, and vice versa, so optimizing for each requires a delicate hand. The expectation was that we did something that would overload the system somehow and caused major regression.

A major discovery was that it wasn’t Linux per se that was slow. Testing the same thing on a significantly smaller machine showed much better performance. We still had to rule out a bunch of other things, such as specific setting / behavior that we would trigger on that particular machine, but it seemed promising. And that was the point when we looked at the hardware. That particular Linux machine is an old development machine that has gone through several developer upgrade cycles, and when it was rebuilt, we used the most easily available disk that we had on hand.

That turned out to be a Crucial SSD 128GB M22 disk. To those of you who don’t keep a catalog of all hard disks and their numbers, there is Google, which will tell you that this has been out for nearly a decade, and that particular disk has been shuffling bits in our offices for about 7 years or so. In its life, it has been subject to literally thousands of database benchmarks, reading and writing very large amount of data.

I’m frankly shocked that it is still working, and it is likely that there is a lot of internal error correction that is going on. But the end result is that it is predictably generate very unpredictable I/O patterns, and it is a great machine to test what happens when things start to fail in a very ungraceful manner (a write to the local disk that takes 5 seconds but also blocks all other I/O operations in the system, for example).

I’m aware of things like nbd & trickle, but it was a lot more fun to discover that we can just run stuff on that particular machine and find out what happens when a lot of our assumptions are broken.

time to read 12 min | 2208 words

You might have seen me talking about how close we are to a RavenDB beta release. Today marked a very important step along the route to an actual release. I’ve shifted my focus. Instead of going head down in the code and pushing things forward and doing all the sort of crazy stuff that you have seen me talking about for the past year and a half, I got started on the Inside RavenDB 4.0 book.

I say started because just the rough table of contents took me almost the entire day to complete. I’m expecting that this will take the majority of my time for the next few months, which means that you’ll get all the drib and drabs from the raw drafts as they are composed.  I’m also using this as a pretty nice way to go over the entire product and see how it all comes together as a cohesive whole, instead of looking at just a single piece every time.

Given that the period of putting bugs in the code is almost over, I feel that I can safely let the rest of the team fish out all the oopsies hat I managed to get in and focus on the product, rather than the code. This is the second time that I have made such a shift (and the third time I’m writing a book), and it still feels awkward. On the other hand, there is a great sense of accomplishment when you see how things just click together and all that hard work is finally real in a way that no code review or artificial scenario can replicate.

Here is what I have planned so far for the book. Your comments are welcome as always.

One of the major challenges in writing this book came in considering how to structure it. There are so many concepts that relate to one another that it can be difficult to try to understand them in isolation. We can't talk about modeling documents before we understand the kind of features that we have available for us to work with, for example. Considering this, I'm going to introduce concepts in stages.

Part I - The basics of RavenDB

Focus: Developers

This part contains a practical discussion on how to build an application using RavenDB, and we'll skip theory and concepts in favor of getting things done. This is what you'll want new hires to read before starting to work with an application using RavenDB, we'll keep the theory and the background information for the next part.

  • Chapter 2 - Zero to RavenDB - focuses on setting you up with a RavenDB instance, introduce the studio and some key concepts and walk you through the Hello World equivalent of using RavenDB by building a very simple To Do app.
  • Chapter 3 - CRUD operations - discusses RavenDB the basics of connecting from your application to RavenDB and the basics that you need to know to do CRUD properly. Entities, documents, attachments, collections and queries.
  • Chapter 4 - The Client API - explores more advanced client scenarios, such as lazy requests, patching, bulk inserts, and streaming queries and using persistent subscriptions. We'll talk a bit about data modeling and working with embedded and related documents.

Part II - Ravens everywhere

Focus: Architects

This part focuses on the theory of building robust and high performance systems using RavenDB. We'll go directly to working with a cluster of RavenDB nodes on commodity hardware, discuss distribution of data and work across the cluster and how to best structure your systems to take advantage of what RavenDB brings to the table.

  • Chapter 5 - Clustering Setup - walks through the steps to bring up a cluster of RavenDB nodes and working with a clustered database. This will also discuss the high availability and load balancing features in RavenDB.
  • Chapter 6 - Clustering Deep Dive - takes you through the RavenDB clustering behavior, how it works and how the both servers & clients are working together to give you a seamless distributed experience. We'll also discuss error handling and recovery in a clustered environment.
  • Chapter 7 - Integrating with the Outside World - explores using RavenDB along side additional systems, for integrating with legacy systems, working with dedicated reporting databases, ETL process, long running background tasks and in general how to make RavenDB fit better inside your environment.
  • Chapter 8 - Clusters Topologies - guides you through setting up several different clustering topologies and their pros and cons. This is intend to serve as a set of blueprints for architects to start from when they begin building a system.

Part III - Indexing

Focus: Developers, Architects

This part discuss how RavenDB index data to allow for quick retrieval of information, whatever it is a single document or aggregated data spanning years. We'll cover all the different indexing methods in RavenDB and how you can should use each of them in your systems to implement the features you want.

  • Chapter 9 - Simple Indexes - introduces indexes and their usage in RavenDB. Even though we have performed queries and worked with the data, we haven't actually dealt with indexes directly so far. Now is the time to lift the curtain and see how RavenDB is searching for information and what it means for your applications.
  • Chapter 11 - Full Text Search - takes a step beyond just querying the raw data and shows you how you can search your entities using powerful full text queries. We'll talk about the full text search options RavenDB provides, using analyzers to customize this for different usages and languages.
  • Chapter 13 - Complex indexes - goes beyond simple indexes and shows us how we can query over multiple collections at the same time. We will also see how we can piece together data at indexing time from related documents and have RavenDB keep the index consistent for us.
  • Chapter 13 - Map/Reduce - gets into data aggregation and how using Map/Reduce indexes allows you to get instant results over very large data sets with very little cost. Making reporting style queries cheap and accessible at any scale. Beyond simple aggregation, Map/Reduce in RavenDB also allows you to reshape the data coming from multiple source into a single whole, regardless of complexity.
  • Chapter 14 - Facet and Dynamic Aggregation - steps beyond static aggregation provided by Map/Reduce and give you the ability to run dynamic aggregation queries on your data, or just facet your search results to make it easier for the user to find what they are looking for.
  • Chapter 15 - Artificial Documents and Recursive Map/Reduce - guides you through using indexes to generate documents, instead of the other way around, and then use that both for normal operations and to support recursive Map/Reduce and even more complex reporting scenarios.
  • Chapter 16 - The Query Optimizier - takes you into the RavenDB query optimizer, index management and how RavenDB is treating indexes from the inside out. We'll see the kind of machinery that is running behind the scenes to get everything going so when you make a query, the results are there at once.
  • Chapter 17 - RavenDB Lucene Usage - goes into (quite gory) details about how RavenDB is making use of Lucene to implement its indexes. This is intended mostly for people who need to know what exactly is going on and how does everything fit together. This is how the sausage is made.
  • Chapter 18 - Advanced Indexing Techniques - dig into some specific usages of indexes that are a bit... outside the norm. Using spatial queries to find geographical information, generating dynamic suggestions on the fly, returning highlighted results for full text search queries. All the things that you would use once in a blue moon, but when you need them you really need them.

Part IV - Operations

Focus: Operations

This part deals with running and supporting a RavenDB cluster or clusters in production. From how you spina new cluster to decommissioning a downed node to tracking down performance problems. We'll learn all that you need (and then a bit more) to understand what is going on with RavenDB and how to customize its behavior to fit your own environment.

  • Chapter 19 - Deploying to Production - guides you through several production deployment options, including all the gory details about how to spin up a cluster and keep it healthy and happy. We'll discuss deploying to anything from a container swarm to bare metal, the networking requirements and configuration you need, security implications and anything else that the operation teams will need to comfortably support a RavenDB cluster in that hard land called production.
  • Chapter 20 - Security - focuses solely on security. How you can control who can access which database, running an encrypted database for highly sensitive information and securing a RavenDB instance that is exposed to the wild world.
  • Chapter 21 - High Availability - brings failure to the table, repeatedly. We'll discuss how RavenDB handles failures in production, how to understand, predict and support RavenDB in keeping all of your databases accessible and high performance in the face of various errors and disasters.
  • Chapter 22 - Recovering from Disasters - covers what happens after disaster strikes. When machines melt down and go poof, or someone issues the wrong command and the data just went into the incinerator. Yep, this is where we talk about backups and restore and all the various reasons why operations consider them sacred.
  • Chapter 23 - Monitoring - covers how to monitor and support a RavenDB cluster in production. We'll see how RavenDB externalize its own internal state and behavior for the admins to look at and how to make sense out of all of this information.
  • Chapter 24 - Tracking Performance - gets into why a particular query or a node isn't performing up to spec. We'll discuss how one would track down such an issue and find the root cause responsible for such a behavior, a few common reasons why such things happen and how to avoid or resolve them.

Part V - Implementation Details

Focus: RavenDB Core Team, RavenDB Support Engineers, Developers who wants to read the code

This part is the orientation guide that we throw at new hires when we sit them in front of the code. It is full of implementation details and background information that you probably don't need if all you want to know is how to build an awesome system on top of RavenDB.

On the other hand, if you want to go through the code and understand why RavenDB is doing something in a particular way, this part will likely answer all those questions.

  • Chapter 25 - The Blittable Format - gets into the details of how RavenDB represents JSON documents internally, how we go to this particular format and how to work with it.
  • Chapter 26 - The Voron Storage Engine - breaks down the low-level storage engine we use to put bits on the disk. We'll walk through how it works, the various features it offers and most importantly, why it had ended up in this way. A lot of the discussion is going to be driven by performance consideration, extremely low-level and full of operating system and hardware minutiae.
  • Chapter 27 - The Transaction Merger - builds on top of Voron and comprise one of the major ways in which RavenDB is able to provide high performance. We'll discuss how it came about, how it is actually used and what it means in terms of actual code using it.
  • Chapter 28 - The Rachis Consensus - talks about how RavenDB is using the Raft consuensus protocol to connect together different nodes in the cluster, how they are interacting with each other and the internal details of how it all comes together (and fall apart and recover again).
  • Chapter 31 - Cluster State Machine - brings the discussion one step higher by talking about how the RavenDB uses the result of the distributed consensus to actually manage all the nodes in the cluster and how we can arrive independently on each node to the same decision reliably.
  • Chapter 30 - Lording over Databases - peeks inside a single node and explores how a database is managed inside that node. More importantly, how we are dealing with multiple databases on the same node and what kind of impact each database can have on its neighbors.
  • Chapter 31 - Replication - dives into the details of how RavenDB manages multi master distributed database. We'll go over change vectors to ensure conflict detection (and aid in its resolution) how the data is actually being replicated between the different nodes in a database group.
  • Chapter 32 - Internal Architecture - gives you the overall view of the internal architecture of RavenDB. How it is built from the inside, and the reasoning why the pieces came together in the way they did. We'll cover both high-level architecture concepts and micro architecture of the common building blocks in the project.

Part VI - Parting

This part summarizes the entire book and provide some insight about what our future vision for RavenDB is.

  • Chapter 33 - What comes next - discusses what are our (rough) plans for the next major version and our basic roadmap for RavenDB.
  • Chapter 34 - Are we there yet? Yes! - summarize the book and let you go and start actually using all of this awesome information.
time to read 3 min | 597 words

imageIn my previous post, I talked about attachments, how they look in the studio and how to work with them from code. In this post, I want to dig a little deeper into how they are actually working.

Attachments are basically blobs that can be attached to a document, a document can have any number of attachments attached to it, and the actual contents of the attachment is actually stored separately from the document. One of the advantages of this separate storage is that it also allows us to handle de-duplication.

The trivial example is needing to attach the same file to a lot of documents will result in just a single instance of that file being kept around. There are actually quite a lot of use cases that call for this (for example, imagine the default profile picture), but this really shines when you start working with revisions. Every time that document changes (which include modifications to attachments, of course), a new revision is created. Instead of having each revision clone all of the attachments, or not have attachments tracked by revisions, each revisions will simply reference the same attachment data.  That way, you can get a whole view of the document at a point in time, implement auditing and tracking, etc.

Another cool aspect of attachments being attached to documents is that they flow in the same manner over replication. So if you modified a document and added an attachment, that modification will be replicated at the same time (and in the same transaction) as that attachment.

In terms of actually working with the attachments, we keep track of the references between documents and attachments internally, and expose them via the document metadata.

This is done so you don’t need to make any additional server calls to get the attachments on a specific documents. You just need to load it, and you have it all there.

This looks like this, note that you can get all the relevant information about the attachment directly from the document, without having to go elsewhere. This is also how you can compare attachment changes across revisions, and this allow you to write conflict resolution scripts that operate on documents and attachments seamlessly.

image

Putting the attachment information inside the document metadata is a design decision that was made because for the vast majority of the cases, the number of attachments per document is pretty small, and even in larger cases (dozens or hundreds of attachments) it works very well. If you have a case where a single document has many thousands or tens of thousands of attachments, that will likely be a very high load on the metadata, and you should consider splitting the attachments into multiple documents (sub documents with some domain knowledge will work).

Let us consider a big customer, to whom we keep issuing invoices. A good problem to have is that eventually we’ll issue enough invoices to that customer that we start suffering from very big metadata load just because of the attachment tracking. We can handle that by using (customer/1234/invoices/2017-04, customer/1234/invoices/2017-05 ) as the documents we’ll use to hang the attachments on.

This was done intentionally, because it mimics the same way you’ll split a file that has an unbounded growth (keeping all invoicing data for a big customer in a single document is also not a good idea, and has the same solution).

time to read 2 min | 251 words

image

I previously wrote about the new attachments feature in RavenDB 4.0. Now it is ready to be seen by outside eyes.

As you can see in the image on the right, documents can now have attached attachments (I’m sorry, couldn’t think about a better way to phrase this). This give you the ability to store binary data inside RavenDB, but not as some free floating value that has only very loose connection to the rest of the system. Those attachments are strongly tied to their parent document, and allow you to store related information easily and right next to the document.

 

 

 

 

That also means that you can take advantage of RavenDB’s ACID nature and actually make modifications to both attachments and documents at the same time. For example, consider the following code:

Here we get the user’s profile picture, generate a thumbnail from it and then associate both picture and thumbnail with the document, we are also updating the status of the user to indicate that they have a profile picture and then submit it all as one single transaction. That means that you don’t have to sync between different sources.

Attachments in RavenDB are also kept consistent through replication, so you won’t see partial results between nodes, and the attachments themselves are using de-duplication techniques to reduce the amount of storage that we take.

I’m really happy with this feature.

time to read 1 min | 88 words


I got the schedule for the upcoming conferences, and realized that I haven’t actually been talking about the conferences we go to, which is a shame, because that is a lot.

time to read 8 min | 1403 words

imageLate last year I talked about our first thoughts about implementing encryption in RavenDB in the 4.0 version. I got some really good feedback and that led to this post, detailing the initial design for RavenDB encryption in the 4.0 version. I’m happy to announce that we are now pretty much done with regards to implementing, testing and banging on this, and we have working full database encryption in 4.0. This post will discuss how we implemented it and additional considerations regarding management and working with encryption databases and clusters.

RavenDB uses ChaCha20Poly1305 authenticated encryption scheme, with 256 bits keys. Each database has a master key, which on Windows is kept encrypted via DPAPI (on Linux we are still figuring out the best thing to do there), and we encrypt each page (or range of pages, if a value takes more than a single page) using its a key derived from the master key and the page number. We are also using a random 64 bits nonce for each page the first time it is encrypted, and then increment the nonce every time we need to encrypt the page again.

We are using libsodium as the encryption library, and in general it make it a pleasure to work with encryption, since it is very focused on getting things done and getting them done right. The number of decisions that we had to make by using it is minimal (which is good). The pattern of initial random generation of the nonce and then incrementing on each use is the recommended method for using ChaCha20Poly1305. The WAL is also encrypted on a per transaction basis, with a random nonce and a derived from the master key and the transaction id.

This encryption is done at the lowest possible level, so it is actually below pretty much anything in RavenDB itself. When a transaction need to access some data, it is decrypted on the fly for it, and then it is available for the duration of the transaction. When the transaction is closed, we’ll encrypt all modified data again, then wipe all the buffered we used to ensure that there is no leakage. The only time we have decrypted data in memory is during the lifetime of a transaction.

A really nice benefit of working at such low level is that all the feature of the database are automatically included. That means that adding our Lucene on Voron implementation, for example, is also part of this, and all indexes are encrypted as well, without having to take any special steps.

Encrypted database does poses some challenges. Not so much for the database author (that’s me) as to the database users (that’s you). If you take a backup of an encrypted database, restoring it pretty much requires that you’ll have the master key to enable actually accessing the data. When you are running in a cluster, that represent another challenge, since you need to make sure that all the nodes in the cluster running the database are encrypting it. Communication about the database should also be encrypted. Tasks such as periodic backup / export / etc also need special treatment. Key generation is also important, you want to make sure that the key isn’t “123456” or some such.

Here is what we came up with. When you create a new encrypted database, you’ll be presented with the following UI:
image

The idea is that you can either have the server generate the key for you (secured, cryptographically random 256 bits), and then we’ll show it to you and allow you to save it / print it / record it for later. This is the only chance you’ll have to get the key from the server.

After you have the key, you then select which nodes and how many this database is going to run on, which will cause it to create the master key on all of those for that particular database. In this manner, all the nodes for this database will use the same key, which simplify some operational tasks (recovery / backup / restore / etc). Alternatively, you can create the database on a single node, and then add additional nodes to it, which will give you the option to generate / provide a key.  I’m not sure whatever ops will be happy with either option, but we have the ability to either have all of them using the same key or having separate key for each node.

Encrypted databases can only reside on nodes that communicate with each other via HTTPS / SSL. The idea is that if you are running an encrypted database, requests between the different nodes should also be private, so you’ll have protection for data at rest and as the data is being moved around. Note that we verify that at the sender, not the origin side. Primarily, this is about typical deployment patterns. We expect to see users running RavenDB behind firewalls / proxy / load balancers, we expect people to use nginx as the SSL endpoint and then talk http to ravendb (which is reasonable if they are on the same machine or using unix sockets).  Regardless, when sending data over the wire, we are always talking RavenDB instance to RavenDB instance via TCP connection, and for encrypted databases, this will require SSL connection to work.

Other things, such as ETL tasks, will generate a suggestion to the user to ensure that they are also secured, but we’ll assume here that this is an explicit operation to move some of the data to another location, possibly filtering it, so we’ll not block it if it isn’t using https (leaving aside the fact that just figuring out whatever the connection to a relational database is using SSL or not is too complex to try), so we’ll warn about it and trust the user.

Finally, we have the issue of backups and exports. Those can be done on a regular basis, and frequently you’ll want them to be done to a remote location (cloud, S3, Dropbox, etc). In that case, all automated backup processes will require you to generate a public / private key pair (and only retain the public key, obviously). From then on, all the backups will be encrypted using the public key and uploaded to the cloud. That means that even if your cloud account is hacked, the hackers can’t do anything with the data, since they are missing the private key.

Again, to encourage users to actually do the right thing and save the private key, we’ll offer it in a form that is easily maintained safely. So you’ll be asked to print it and store it in some file folder somewhere (in addition to whatever digital backups you have), so you can restore it at a later point when / if you need it.

For manual operations, such as exporting an encrypted database, we’ll warn if you are trying to export without a key, but allow it (since forcing a key for a manual operation does nothing to security). Conversely, exporting a non encrypted database will also allow you to provide key pair and encrypt it, both for manual operations and automatic backup configurations.

Aside from those considerations, you can pretty much treat an encrypted database as a regular one (except, of course, that the data is encrypted at all times unless actively accessed). That means that all features would just work. The cost of actually encrypting and decrypting all the time is another concern, and we have seen about 60% additional cost in write speed and about 15% extra cost for reading.

Just to give you some idea about the performance we are talking about… Ingesting the entire Stack Overflow dataset, some 52GB in size and over 18 million documents can be done in 22 minutes with encryption. Without encryption we are faster, roughly 13 and a half minutes on that same machine, but that still gives us a rate of close to 2.5 GB per minute with encryption.

As you can tell from the mockup, while we have completed most of the encryption work around the engine, the actual UI and behavioral semantics are still in a bit of a flux. Your comments about those are welcome.

time to read 3 min | 545 words

We need to store an encryption key on Linux & Windows. On Windows, the decision is pretty much trivial, you throw that into DPAPI, and can rely on the operating system to handle that for us. In particular, it is very easy to analyze key scenarios such as “someone stole the hard disk” and say that either the thief wouldn’t be able to get the plain text key, or we can blame Microsoft for that Smile.

On Linux, the situation seems to be much more chaotic. There is libsecret, which seems to be much wider in scope than DPAPI. Whereas DPAPI has 2 methods (protect & unprotect), libsecret has a lot of moving pieces, which is quite scary. That is leaving aside the issue of having no managed implementation and having to dance around Gnome specific data types in the API (need to pass GCancellable & GError into it) which increase the complexity.

Other options include using some sort of hardware / software security modules (such as HashiCorp Vault), which is great in theory, but requires us to either take a dependency on something that might not be there, or try to support a wide variety of options (Keywhiz, Chef, Puppet, CloudHSM, etc). That isn’t a really good alternative from our point of view.

Looking into how Mono implemented the DPAPI on Linux, they did it by writing a master key to an XML file and relied on file system ACLs to prevent anyone from seeing that information. This end up being this:

chmod(path, (S_IRUSR | S_IWUSR | S_IXUSR) );

Which has the benefit of only allowing that user to access it, but given that I’ve gotten the physical disk, I’m able to easily mount that on machine that I control as root and access anything that I like. On Windows, by the way, the way this is prevented is that the user must have logged in, and a key that is derived from their password is used to decrypt all protected data as needed, so without the user logging in, you cannot decrypt anything. For that matter, even the administrator on the machine can’t recover the data if they want to, because resetting the user’s password will cause all such information to be lost.

There is the Gnome.Keyring project as well, which hasn’t been updated in 7 years, and obviously doesn’t support the kwallet (which libsecret does). OWASP seems to be throwing the towel there and just recommend to rely on the file system ACL.

The Linux Kernel has a Key Retention API, but it seems to be targeted primarily toward giving file systems access to the secrets they need, and it looks like it isn’t going to survive reboots (it is primarily a caching mechanism, it looks like?).

So after all this research, I can say that I don’t like libsecret, it seems too cumbersome and will need users to jump through some hoops in some cases (install additional packages, setup access, etc). 

Setting up the permissions via the ACL seems to be the common way to handle this scenario, but it doesn’t sit well with me.

Any recommendations?

time to read 2 min | 364 words

I thought that I found a bug in the TPL, but it looks like its working (more or less) by design. Basically, when a task is completed, all awaiting tasks will be notified on that, which is pretty much what you would expect. What isn’t usually expected is that those tasks can interfere with one another. Consider the following code:

We have two tasks, which accept a parent task and do something with it. What do you think will happen when we run the following code?

Unless you are very quick on the draw, running this code will result in a timeout message, but how? We know that we have a much shorter duration for the task than the timeout, so what is going on?

Well, effectively what is going on is that the parent task has a list of children that it will notify, and by default, it will do so synchronously and sequentially. If a child task blocks for whatever reason (for example, it might be processing a lot of work), the other children of the parent task will not be notified.

If there is a timeout setup, it will be triggered, even though the parent task was already completed. It took us a lot of time to figure out the repro in this issue, and we were certain that this is some sort of race condition in the TPL. I had a blog post talking all about it, but the Microsoft team is fast enough that they were able to literally answer my issue before I had the time to complete my blog post. That is really impressive.

I should note that the suggestion, using RunContinuationsAsynchronously, works quite well for creating a new Task or using TaskCompletionSource, but there is no way to specify that when you are using Task.Run. What is worse for us is that since this is not the default (for perfectly good performance reasons, by the way), this means that any code that we call into might trigger this. I would have much rather to be able to specify than when waiting on the task, rather than when creating it.

time to read 1 min | 156 words

image[6]RavenDB 4.0 now has official Docker support. That means that if you have Docker installed, you can get RavenDB 4.0 running on your system with the following a single command:

This single command is enough to spin a new RavenDB instance that you can immediately start using.

If you are using the default Docker setup on Windows, you can immediately go to:

http://10.0.75.2:8080/

And you are pretty much done. That is a pretty awesome experience, I think.

We have support for Ubuntu and Windows Nano Server images, and you can run a more full pledge script that give you more exact control over what is going on.

time to read 5 min | 818 words

When building a distributed system, one of the more interesting aspects is how you are going to distribute tasks assignment. In other words, given that you have multiple nodes, how do you decide which node will do what? In some cases, that is relatively easy, you can say “all nodes will process read requests”, but in others, this is more complex. Let us take the case where you have several nodes, and you need to have a regular backup of a database that is replicated between all those nodes. You probably don’t want to run the backup across all the nodes, after all, they are pretty much the same and you don’t want to backup the exact same thing multiple times. On the other hand, you probably don’t want to assign this work statically, if you do, and if the node that is responsible for the backup is down, you got no backup.

Another example of the problem can be seen when you have other processes that you would like to be sticky if possible, and only jump around if there is a failure. Brining up a new node online is a common thing to do in a cluster, and the ideal scenario in that case is that a single node will feed it all the data that it needs. If we have multiple nodes doing that, they are likely to overlap and they might very well overload the poor new server. So we want just one node to update its state, but if that node goes down midway, we need someone else to pick up the slack.. For RavenDB, those sort of tasks includes things like ETL processes, Subscriptions, backup, bootstrapping new servers and more. We keep discovering new things that can use this sort of behavior.

But how do we actually make this work?

One way of doing this is to take advantage of the fact that RavenDB is using Raft and have the leader do task assignment. That works quite well, but it doesn’t scale. What do I meant by that? I mean that as the number of tasks that we need to manage grows, the complexity in the task assignment portion of the code grows as well. Ideally, I don’t want to have to consider twenty different variables and considerations before deciding what operation should go on which server, and trying to balance that sort of work in one place has proven to be challenging.

Instead, we have decided to allocate work in the cluster based on simple rules. Each task in the cluster has a key (which is typically generated by hashing some of its parameters), and that task can be assigned to a group of servers. Given those two details, we can use Jump Consistent Hashing to spread the load around. However, that doesn’t handle failover. We have a heartbeat process that can detect and notify nodes that a sibling has went down, so combining those two, we get the following code:

What we are doing here is rely on two different properties. Jump Consistent Hashing to let us know which node is responsible for what, and the Raft cluster leader that keep track of all the live nodes and let us know when a node goes down. When we need to assign a task, we use its hashed key to find its preferred owner, and if it is alive, that is that. But if it currently down, we do two things, we remove the downed node from the topology and re-hash the key with the new number of nodes in the cluster. That gives us a new preferred node, and so on until we find a live one.

The reason we rehash on failover is that Jump Consistent Hashing is going to usually point to the same position in the topology (that is why we choose it in the first place, after all), so we rehash to get a different position so it won’t all fall unto the next node in the list. All downed node tasks are fairly distributed among the remaining live cluster members.

The nice thing about this is that aside from keeping the live/down list up to date, the Raft cluster doesn’t really need to do something. This is a consistent algorithm, so different nodes operating on the same data can arrive at the same result, so a node going down will result in another node picking up on updating the new server up to spec and another will start a backup process. And all of that logic is right where we want it, right next to where the task logic itself is written.

This allow us to reason much more effectively about the behavior of each independent task, and also allow each node to let you know where each task is executing.

FUTURE POSTS

  1. fsync()-ing a directory on Linux (and not Windows) - 7 hours from now

There are posts all the way to Jun 09, 2025

RECENT SERIES

  1. Webinar (7):
    05 Jun 2025 - Think inside the database
  2. Recording (16):
    29 May 2025 - RavenDB's Upcoming Optimizations Deep Dive
  3. RavenDB News (2):
    02 May 2025 - May 2025
  4. Production Postmortem (52):
    07 Apr 2025 - The race condition in the interlock
  5. RavenDB (13):
    02 Apr 2025 - .NET Aspire integration
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}