Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,583
|
Comments: 51,214
Privacy Policy · Terms
filter by tags archive
time to read 2 min | 332 words

Here is the original post, and now let us get down to solving it…

The key part of solving this issue is knowing that if you wait for the actual cluster change, there is very little that you can do. Technically speaking, if you have the keys, you can try to figure out all the ranges that fallunder the new nodes in the cluster, and then query on all those ranges. It will work, but anyone who answers that is going to be hit when there are multiple concurrent additions to the cluster. (Adding 1 node, then 3, then 5, etc). That is something that is incredibly common when you start going up, and if you having moved all the data yet, you don’t want to have to wait until you do that before you start responding to the current workload. There there are things like “what happens if we reboot midway through”, etc?

A much simpler alternative is to move some of the work to write time for each data item. We already need to compute which node a particular item is going to reside on, after all. What we are also going to compute is when that is going to change. And we are going to record that. When the cluster size grows above that size, we can simple query for all the data items that are going to be moved when the cluster size is higher. This way, we gain a couple of interesting properties.

We don’t need to worry about adding multiple nodes concurrently, just doing “WHERE NeedToMoveWhenSizeIsGreaterThan < :ClusterSize” is going to be enough to find all the data items that needs to be moved, it is resilient to restarts / errors, and can gracefully handle the multiple moves scenario.

Oh, and how to find the next cluster size where this particular data item is going to have to move? Well, I got to let the future candidates with googling skillz something to actually handle.

time to read 3 min | 522 words

So far, we looked at naïve options for data storage and for building indexes, and we found them lacking. The amount of complexity involved was just too much, and the performance costs were not conductive for good business.

In this post, I want to explore the Log Structure Merge option (LSM).  This is a pretty simple solution. Our file format remains pretty simple. It is just a flat list of records, but we add a very small twist. For each collection of data (we can call it a table, an index, or whatever), all the records are going to be sorted inside that file based on some criteria.

In other words, here is our file again:

image

But what about updates? As we mentioned, adding a user with the username ‘baseball’ will force us to move quite a lot of data. Well, the answer to that is that we are not going to modify the existing file. Indeed, in LSM, once a file has been written out, it can never be changed again. Instead, we are going to create a new file, with the new information.

When we query, we’ll search the files in descending order, so newer files are checked first. That allows us to see the updated information. Such a system also rely on tombstone markers to delete values, and it is even possible to run range searches by scanning multiple files (merge sorting on the fly). Of course, over time, the number of files you are using is going to increases, so any LSM solution also has a merge phase (it is right there in the name), where the data among many files is merged together.

This lead to some interesting challenges. Scanning a file to see if a value is there can be expensive (seeks, again), so we typically will use something like a bloom filter to skip that if possible. Merging files is expensive (a lot of I/O), so we want to be sure that we aren’t doing that too often, and yet not doing that means that we have to do a lot more operations, so there are a lot of heuristics involved.

LSM can be a particularly good solution for certain kinds of data stores. Lucene is actually able to do significant optimizations in the way it works as a result of LSM, because it clears internal data structures during the merge operation. Other databases which uses LSM are LevelDB, RocksDB, Cassandra, etc.

Personally, I don’t like LSM solutions very much, it seems that in pretty much any such solution I saw, the merge heuristics were incredibly capable of schedule expensive merges just when I didn’t want them to do anything. And there is quite a bit of complexity involved with managing potentially large number of files. There is also another issue, it is pretty hard to have physical separation of the data using LSM, you typically have to use separate file for each, which also doesn’t help very much.

A much more elegant solution in my view is the B+Tree, but I’ll keep that for the next post. 

time to read 4 min | 734 words

In the previous post, I showed how we can use a simple text based format to store records, and what it means for reading / writing those records. Searching in such a system means that you have quite a bit of work to do, since you need to scan the whole file to do so.

That isn’t going to really work for us. So we need to introduce indexes. An index is actually really simple idea. Given that we have the original file, we’ll have the following index for username:

image

Basically, the first column in the index is the value of the username, and those values are sorted, so searching the index has a O(logN) complexity associated with it. Once we have done that, we have the record number, and we can use that to find the position in the file and read the whole record.

All in all, this is pretty basic, right? So why am I dedicating a whole post to this topic?

Well, the issue with indexes would be so simple if we were planning on only doing them once. But we want to update them, and that lead to certain issues.  While the index above shows only a few records, the actual data size we are talking about here is a million records. That gives us a total index size of about 16MB. What happens if we need to add a new username? In this case, a die hard fan whose username is ‘baseball’ ?

Well, in order to maintain the order, we’ll have to put it between the first two entries. Which will require us to move the rest of the data by that same amount. Effectively, in order to add a single entry to the index, we’ll have to write 16MB.

In other words, because files don’t allow us to dynamically add / remove data from the middle of the file without a lot of expensive I/O, we can’t really use flat files for this. It isn’t a surprise, but I wanted to start from the bare minimum and walk us up in the complexity hierarchy as we discover why we need those things.

So we need to find a file format that will allow us to update things without having to shuffle the entirety of the file every time we do that. Looking at the literature, I can compare the flat file approach to having a sorted array, and it has the same problems. The typical solution for that is to use a binary search tree. In this way, we always have the root of the tree at beginning of the file, and use offsets to jump around in the file according to where we need to go.

The code for searching in this kind of file looks like this:

image

Note that this is with all error handling removed, and while it gives us what we wants, this solution has several issues.

To start with, if we want to have good performance, we’ll need to balance the tree on inserts / deletes, and that can be complex. Not only that, but it also means that the number of file operations that we’ll actually perform is pretty high here. This isn’t good, because files reside on (very slow) hardware, so we probably don’t want to use that.

Instead, we need to find something that will allow us to do a lot less seeks in the file (which takes about 10 – 20 ms on a standard HDD), and can handle concurrent work better (won’t be fighting over the disk head position so much). We also need to make sure that the amount of work that we have to do when we modify something is minimal. In the case of a balanced tree, the amount of work to balance it can be extremely high, and the number of places you’ll have to jump around (thus, seek) is incredible.

In the next post, we’ll discuss persistent algorithm options here, and what we should use to store the data.

time to read 6 min | 1099 words

Reading the comments in this Reddit thread gave me a bit of a pause. Mostly because of the level of knowledge about what I consider basic stuff. Then again, I can’t do any UI that doesn’t involve the <table/> tag, so I guess that make sense. So I decided to do something about this lack of knowledge. In this series, I’m going to go over just those kind of details that I consider essential to understand how databases actually work.

Imagine that you had a program that needed to persist some state. Let us say that we need to keep track of users and how often they log into the system. You would typically use your database of choice to do so, and you’ll be flying away in no time.

But for this series, we are assuming that no such software exist, so we have to roll our own persistence solution. The simplest solution for this that is to simply store the data as a CSV file. Here is what we end up doing:

image

Now, this has several advantages, simplicity being a key one, but it also suffer from pretty much every possible disadvantage.

Let us consider the following simple scenarios:

  • I want to update the last login date
  • I want to update the user name
  • I want to search by user name
  • I want to get records sorted by last login date
  • I want to add a record
  • I want to delete a record

Oh, and I want to do all of the above when I have a million records.

CSV is a great thing, if the number of records we have is small. To be fair, on modern hardware, a million records is small, but we’ll ignore this for reasons that will be explained in the following posts.

Let us look at how the file actually looks like, shall we?

image

Now, if we want to update the last login date for Oren, that is pretty easy. We can do that using the following sequence of operations:

image

I’m intentionally using C API here, because that is what actually is going on, and we need to understand and work with that.

Notice a few things that are going on in here. We open the file, seek to the date’s position on the first data row, and then we update the date with a value that has the same size. Finally, we close the file.

Are mission has met success! Let all retire somewhere for breakfast (I’m writing this at 5:20 AM after a “night” that had no darkness while being in NDC Oslo. Looking out the window I see clear skies, great visibility and absolutely no people. It feels like the opening scene in a Zombie movie.

Anyway, before breakfast can commence, let us try to update the user name. Arava, while being an awesome German Shepherdess, has chosen a bad username. No matter what I call her when she chew a tennis ball in the bed again. Let us see what we’ll need to update her username to the more neutral “dog”.

image

Now, you might notice that I’m somewhat cheating here. The problem is that the size of the values are different, but there is no real way for us to just cut some part of the file, so I’m abusing the fact that the CSV parser will ignore any whitespace at the beginning of the value to “trim” the value.

But what happen if I didn’t want to shorten the username? What if I wanted the username to be shepherdess? Well, in this case, I wouldn’t be able to do that. I don’t have enough space, and if I tried, I would overwrite the next record, and probably corrupt the whole file.

Bummer.

Typically, at this point we have to abandon the simplicity of a CSV file and move over to a slightly better format. In this case, we can go with fixed size records. In other words, we’ll define the maximum size for each field. Here is what this looks like:

image

This format is harder for humans to generate, because if we do this manually we have to count spaces, etc. But this is actually much easier for us to work with. We have the following format:

  • Full Name – Max 16 chars
  • User Name – Max 12 chars
  • Last Login – Exactly 19 chars

Computing where to write the last login date for a particular record is now simply the following math:

(16 + 12 + 19 + 2) * NumberOfRecords + 16 + 12 + 42

42 is the length in bytes of the two first rows. And the +2 is for the line terminator.

So far so good, now we have a file format, and the ability to work with it. Which is great, and even quite convenient for us. But it isn’t really suitable for doing anything but full scans of the file. Here is the logic for searching a record by username:

image

By the way, I’m doing this in C also because it is fun, it has been a while since I did C, so it might be wrong.

As you can  see, the cost of actually doing any sort of operation in this format is simple, O(N). That isn’t going to work for us.

In the next post in this series, I’m going to talk about indexes, and how they work. After that, we will start talking about the actual on disk format, and what it is impacted by.

time to read 3 min | 410 words

The following is likely to end up in the list of questions we’ll ask candidates to answer when they apply to Hibernating Rhinos.

Imagine a sharded database. A sharded database is one where the data is split among multiple nodes. To make things simple, we will assume that each datum in the database has a 64 bits key associated with it, and we are interested in distributing the information evenly among the nodes. This can be done using Jump Consistent Hashing (see paper for details), and can be implemented using the following simple C# function:

This function is responsible for taking a key and (given how many nodes there are in the cluster) provide which node this key resides on.

So far, so good, and this make quite a lot of things much simpler. This function ensures that roughly 1/N of the data items in the databases will require movement to a new node when it is introduced. Which is pretty much exactly what we want in a sharded environment. However, this function doesn’t help us figure out what to move.

Assume that we already have a database that has 5 nodes, and ten billion data items spread across all 5 nodes, spread according to the consistent jump function. Because of load, we need to add additional 2 nodes to the cluster, and we need to move 2/7 (2.8 billion data items) of the cluster data to the new nodes. However, since moving the data items alone is going to be costly, we want to avoid scanning through all 10 billion items in all nodes to figure out which ones we need to ship to the new nodes, and which ones should remain with the current node.

Suggest a way that will allow the database to find out which data items need to be moved to the new nodes, without having to scan through all of them. In other words, anything that requires O(number of items in each node) is out.

You are rated on the impact of your suggestion on the overall cluster behavior. The cheaper your option, the better. You are free to store additional information (at cluster creation / modification, as data items are entered into the cluster / deleted, etc) if this can help you, but be aware that any impact on standard operations (reads & writes) should be minimal and well justified.

You only need to consider adding nodes to the cluster, removing nodes from the cluster is not required.

time to read 3 min | 578 words

For the actual question, see the original post.

So the first thing that we need to decide is what will be the data format on the tire. Since we have only 32KB to work with, we need to consider the actual storage.

32KB is small enough to fit in a unsigned short, so all the references we’ll used will be shorts. We also need to store a bit of metadata, so we’ll use the first 4 bytes as the header for just that.

  • ushort SpaceUsed;
  • ushort LastAllocation;

Now that we have this, we need to decide how to store the actual data. To make things easy, we are going to define the following way to allocate memory:

This is about the simplest way that you can go about doing things, note that we use a length prefix value, and we limit allocations to a max of 127 bytes each. We use a negative size to indicate a delete marker.

So basically, now we have a pretty trivial way to allocate memory, and we can implement the trie as we would normally do. There are a few wrinkles, however.

Deleting the memory doesn’t actually make it eligible for reuse, and it is quite likely to get fragmented easily. In order to handle that, we will track the amount of space that is used, and if we got to the end of the space, we’ll check the UsedSpace value. If this is still too little, we can abort, there is no available space here. However, if we go to the end of the buffer, but we have free space available, we can do the following:

  • Scan the buffer for available spots (find available locations that have negative size).
  • Failing that, we will copy the data to a temporary buffer, then re-add everything to the buffer from scratch. In other words, we defrag it.

Another issue we have is that the maximum size we can allocate is 127. This value is big enough so most actual strings can fit into it nicely, but a trie already has the property that a large string might be broken into pieces, we’ll just cut each node in the trie to a max size of 127. Actually, the max size is likely to be less than that, because there is also some information that we need to keep track per entry.

  • byte NumberOfChildren;
  • byte Flags; // node type, (internal, leaf or both)
  • ushort ChildrenPosition;

So in practice we have about 123 bytes to work with for the length. Note that we don’t store the string value of the node’s length (we can get that from the allocation information), and that we store the actual children in an array that is stored separately. This allows us to easily add items to the trie as child nodes. If the node is a leaf node, we also need to store the actual value (which is 8 bytes), we store that information at the end of the value (giving us 115 bytes for that section of the value).

All in all, there is going to be a bit of pointer arithmetic and bit counting, but is likely to be a pretty simple implementation.

Note that additional optimizations would be to try align everything so it would fit into a cache line, trying to place nodes near their children (which are more likely to be followed), etc.

time to read 1 min | 145 words

The slides from the RavenDB Conference are now available, we’ll post videos in a few weeks, once we are done post processing them.

Day 1:

Day 2:

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Production postmorterm (2):
    11 Jun 2025 - The rookie server's untimely promotion
  2. Webinar (7):
    05 Jun 2025 - Think inside the database
  3. Recording (16):
    29 May 2025 - RavenDB's Upcoming Optimizations Deep Dive
  4. RavenDB News (2):
    02 May 2025 - May 2025
  5. Production Postmortem (52):
    07 Apr 2025 - The race condition in the interlock
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}