Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,583
|
Comments: 51,214
Privacy Policy · Terms
filter by tags archive
time to read 10 min | 1985 words

In the previous post, we looked into what it would take to reduce the cost of filtering negative numbers. We got into the assembly and analyzed exactly what was going on. In terms of this directly, I don’t think that even hand-optimized assembly would take us further. Let’s see if there are other options that are available for us to get better speed.

The first thing that pops to mind here is to do a loop unrolling. After all, we have a very tight loop, if we can do more work per loop iteration, we might get better performance, no? Here is my first version:

And here are the benchmark results:

Method N Mean Error StdDev Ratio Code Size
FilterCmp 23 274.6 ns 0.40 ns 0.35 ns 1.00 411 B
FilterCmp_Unroll 23 257.5 ns 0.94 ns 0.83 ns 0.94 606 B
             
FilterCmp 1047 748.1 ns 2.91 ns 2.58 ns 1.00 411 B
FilterCmp_Unroll 1047 702.5 ns 5.23 ns 4.89 ns 0.94 606 B
             
FilterCmp 1048599 501,545.2 ns 4,985.42 ns 4,419.45 ns 1.00 411 B
FilterCmp_Unroll 1048599 446,311.1 ns 3,131.42 ns 2,929.14 ns 0.89 606 B
             
FilterCmp 33554455 29,637,052.2 ns 184,796.17 ns 163,817.00 ns 1.00 411 B
FilterCmp_Unroll 33554455 29,275,060.6 ns 145,756.53 ns 121,713.31 ns 0.99 606 B

That is quite a jump, 6% – 11% savings is no joke. Let’s look at what is actually going on at the assembly level and see if we can optimize this further.

As expected, the code size is bigger, 264 bytes versus the 55 we previously got. But more importantly, we got the range check back, and a lot of them:

The JIT isn’t able to reason about our first for loop and see that all our accesses are within bounds, which leads to doing a lot of range checks, and likely slows us down. Even with that, we are still showing significant improvements here.

Let’s see what we can do with this:

With that, we expect to have no range checks and still be able to benefit from the unrolling.

Method N Mean Error StdDev Ratio RatioSD Code Size
FilterCmp 23 275.4 ns 2.31 ns 2.05 ns 1.00 0.00 411 B
FilterCmp_Unroll 23 253.6 ns 2.59 ns 2.42 ns 0.92 0.01 563 B
               
FilterCmp 1047 741.6 ns 5.95 ns 5.28 ns 1.00 0.00 411 B
FilterCmp_Unroll 1047 665.5 ns 2.38 ns 2.22 ns 0.90 0.01 563 B
               
FilterCmp 1048599 497,624.9 ns 3,904.39 ns 3,652.17 ns 1.00 0.00 411 B
FilterCmp_Unroll 1048599 444,489.0 ns 2,524.45 ns 2,361.38 ns 0.89 0.01 563 B
               
FilterCmp 33554455 29,781,164.3 ns 361,625.63 ns 320,571.70 ns 1.00 0.00 411 B
FilterCmp_Unroll 33554455 29,954,093.9 ns 588,614.32 ns 916,401.59 ns 1.01 0.04 563 B

That helped, by quite a lot, it seems, for most cases, the 32M items case, however, was slightly slower, which is quite a surprise.

Looking at the assembly, I can see that we still have branches, like so:

And here is why this is the case:

Now, can we do better here? It turns out that we can, by using a branchless version of the operation. Here is another way to write the same thing:

What happens here is that we are unconditionally setting the value in the array, but only increment if the value is greater than or equal to zero. That saves us in branches and will likely result in less code. In fact, let’s see what sort of assembly the JIT will output:

What about the performance? I decided to pit the two versions (normal and branchless) head to head and see what this will give us:

Method N Mean Error StdDev Ratio Code Size
FilterCmp_Unroll 23 276.3 ns 4.13 ns 3.86 ns 1.00 411 B
FilterCmp_Unroll_Branchleses 23 263.6 ns 0.95 ns 0.84 ns 0.96 547 B
             
FilterCmp_Unroll 1047 743.7 ns 9.41 ns 8.80 ns 1.00 411 B
FilterCmp_Unroll_Branchleses 1047 733.3 ns 3.54 ns 3.31 ns 0.99 547 B
             
FilterCmp_Unroll 1048599 502,631.1 ns 3,641.47 ns 3,406.23 ns 1.00 411 B
FilterCmp_Unroll_Branchleses 1048599 495,590.9 ns 335.33 ns 297.26 ns 0.99 547 B
             
FilterCmp_Unroll 33554455 29,356,331.7 ns 207,133.86 ns 172,966.15 ns 1.00 411 B
FilterCmp_Unroll_Branchleses 33554455 29,709,835.1 ns 86,129.58 ns 71,922.10 ns 1.01 547 B

Surprisingly enough, it looks like the branchless version is very slightly slower. That is a surprise, since I would expect reducing the branches to be more efficient.

Looking at the assembly of those two, the branchless version is slightly bigger (10 bytes, not that meaningful). I think that the key here is that there is a 0.5% chance of actually hitting the branch, which is pretty low. That means that the branch predictor can likely do a really good job and we aren’t going to see any big benefits from the branchless version.

That said… what would happen if we tested that with 5% negatives? That difference in behavior may cause us to see a different result. I tried that, and the results were quite surprising. In the case of the 1K and 32M items, we see a slightl cost for the branchless version (additional 1% – 4%) while for the 1M entries there is an 18% reduction in latency for the branchless version.

I ran the tests again with a 15% change of negative, to see what would happen. In that case, we get:

Method N Mean Error StdDev Ratio RatioSD Code Size
FilterCmp_Unroll 23 273.5 ns 3.66 ns 3.42 ns 1.00 0.00 537 B
FilterCmp_Unroll_Branchleses 23 280.2 ns 4.85 ns 4.30 ns 1.03 0.02 547 B
               
FilterCmp_Unroll 1047 1,675.7 ns 29.55 ns 27.64 ns 1.00 0.00 537 B
FilterCmp_Unroll_Branchleses 1047 1,676.3 ns 16.97 ns 14.17 ns 1.00 0.02 547 B
               
FilterCmp_Unroll 1048599 2,206,354.4 ns 6,141.19 ns 5,444.01 ns 1.00 0.00 537 B
FilterCmp_Unroll_Branchleses 1048599 1,688,677.3 ns 11,584.00 ns 10,835.68 ns 0.77 0.01 547 B
               
FilterCmp_Unroll 33554455 205,320,736.1 ns 2,757,108.01 ns 2,152,568.58 ns 1.00 0.00 537 B
FilterCmp_Unroll_Branchleses 33554455 199,520,169.4 ns 2,097,285.87 ns 1,637,422.86 ns 0.97 0.01 547 B

As you can see, we have basically the same cost under 15% negatives for small values, a big improvement on the 1M scenario and not much improvement on the 32M scenario.

All in all, that is very interesting information. Digging into the exact why and how of that means pulling a CPU instruction profiler and starting to look at where we have stalls, which is a bit further that I want to invest in this scenario.

What if we’ll try to rearrange the code a little bit. The code looks like this (load the value and AddToOutput() immediately):

AddToOutput(ref itemsRef, Unsafe.Add(ref itemsRef, i + 0));

What if we split it a little bit, so the code will look like this:

The idea here is that we are trying to get the JIT / CPU to fetch the items before they are actually needed, so there would be more time for the memory to arrive.

Remember that for the 1M scenario, we are dealing with 8MB of memory and for the 32M scenario, we have 256MB. Here is what happens when we look at the loop prolog, we can see that it is indeed first fetching all the items from memory, then doing the work:

In terms of performance, that gives us a small win (1% – 2% range) for the 1M and 32M entries scenario.

The one last thing that I wanted to test is if we’ll unroll the loop even further, what would happen if we did 8 items per loop, instead of 4.

There is some improvement, (4% in the 1K scenario, 1% in the 32M scenario) but also slowdowns  (2% in the 1M scenario).

I think that this is probably roughly the end of the line as far as we can get for scalar code.

We already made quite a few strides in trying to parallelize the work the CPU is doing by just laying out the code as we would like it to be. We tried to control the manner in which it touches memory and in general, those are pretty advanced techniques.

To close this post, I would like to take a look at the gains we got. I’m comparing the first version of the code, the last version we had on the previous post and the unrolled version for both branchy and branchless with 8 operations at once and memory prefetching.

Method N Mean Error StdDev Ratio RatioSD Code Size
FilterCmp 23 277.3 ns 0.69 ns 0.64 ns 1.00 0.00 411 B
FilterCmp_NoRangeCheck 23 270.7 ns 0.42 ns 0.38 ns 0.98 0.00 397 B
FilterCmp_Unroll_8 23 257.6 ns 1.45 ns 1.21 ns 0.93 0.00 672 B
FilterCmp_Unroll_8_Branchless 23 259.9 ns 1.96 ns 1.84 ns 0.94 0.01 682 B
               
FilterCmp 1047 754.3 ns 1.38 ns 1.22 ns 1.00 0.00 411 B
FilterCmp_NoRangeCheck 1047 749.0 ns 1.81 ns 1.69 ns 0.99 0.00 397 B
FilterCmp_Unroll_8 1047 647.2 ns 2.23 ns 2.09 ns 0.86 0.00 672 B
FilterCmp_Unroll_8_Branchless 1047 721.2 ns 1.23 ns 1.09 ns 0.96 0.00 682 B
               
FilterCmp 1048599 499,675.6 ns 2,639.97 ns 2,469.43 ns 1.00 0.00 411 B
FilterCmp_NoRangeCheck 1048599 494,388.4 ns 600.46 ns 532.29 ns 0.99 0.01 397 B
FilterCmp_Unroll_8 1048599 426,940.7 ns 1,858.57 ns 1,551.99 ns 0.85 0.01 672 B
FilterCmp_Unroll_8_Branchless 1048599 483,940.8 ns 517.14 ns 458.43 ns 0.97 0.00 682 B
               
FilterCmp 33554455 30,282,334.8 ns 599,306.15 ns 531,269.30 ns 1.00 0.00 411 B
FilterCmp_NoRangeCheck 33554455 29,410,612.5 ns 29,583.56 ns 24,703.61 ns 0.97 0.02 397 B
FilterCmp_Unroll_8 33554455 29,102,708.3 ns 42,824.78 ns 40,058.32 ns 0.96 0.02 672 B
FilterCmp_Unroll_8_Branchless 33554455 29,761,841.1 ns 48,108.03 ns 42,646.51 ns 0.98 0.02 682 B

The unrolled 8 version is the winner by far, in this scenario (0.5% negatives). Since that is the scenario we have in the real code, that is what I’m focusing on.

Is there anything left to do here?

My next step is to explore whether using vector instructions will be a good option for us.

time to read 7 min | 1247 words

While working deep on the guts of RavenDB, I found myself with a seemingly simple task. Given a list of longs, I need to filter out all negative numbers as quickly as possible.

The actual scenario is that we run a speculative algorithm, given a potentially large list of items, we check if we can fulfill the request in an optimal fashion. However, if that isn’t possible, we need to switch to a slower code path that does more work.

Conceptually, this looks something like this:

That is the setup for this story. The problem we have now is that we now need to filter the results we pass to the RunManually() method.

There is a problem here, however. We marked the entries that we already used in the list by negating them. The issue is that RunManually() does not allow negative values, and its internal implementation is not friendly to ignoring those values.

In other words, given a Span<long>, I need to write the code that would filter out all the negative numbers. Everything else about the list of numbers should remain the same (the order of elements, etc).

From a coding perspective, this is as simple as:

Please note, just looking at this code makes me cringe a lot. This does the work, but it has an absolutely horrible performance profile. It allocates multiple arrays, uses a lambda, etc.

We don’t actually care about the entries here, so we are free to modify them without allocating a new value. As such, let’s see what kind of code we can write to do this work in an efficient manner. Here is what I came up with:

The way this works is that we scan through the list, skipping writing the negative lists, so we effectively “move down” all the non-negative lists on top of the negative ones. This has a cost of O(N) and will modify the entire array, the final output is the number of valid items that we have there.

In order to test the performance, I wrote the following harness:

We compare 1K, 1M and 32M elements arrays, each of which has about 0.5% negative, randomly spread across the range. Because we modify the values directly, we need to sprinkle the negatives across the array on each call. In this case, I’m testing two options for this task, one that uses a direct comparison (shown above) and one that uses bitwise or, like so:

I’m testing the cost of sprinkling negatives as well, since that has to be done before each benchmark call (since we modify the array during the call, we need to “reset” its state for the next one).

Given the two options, before we discuss the results, what would you expect to be the faster option? How would the size of the array matter here?

I really like this example, because it is simple, there isn’t any real complexity in what we are trying to do. And there is a very straightforward implementation that we can use as our baseline. That also means that I get to analyze what is going on at a very deep level. You might have noticed the disassembler attribute on the benchmark code, we are going to dive deep into that. For the same reason, we aren’t using exactly 1K, 1M, or 32M arrays, but slightly higher than that, so we’ll have to deal with remainders later on.

Let’s first look at what the JIT actually did here. Here is the annotated assembly for the FilterCmp function:

For the FilterOr, the code is pretty much the same, except that the key part is:

As you can see, the cmp option is slightly smaller, in terms of code size. In terms of performance, we have:

Method N Mean
FilterOr 1047 745.6 ns
FilterCmp 1047 745.8 ns
FilterOr 1048599 497,463.6 ns
FilterCmp 1048599 498,784.8 ns
FilterOr 33554455 31,427,660.7 ns
FilterCmp 33554455 30,024,102.9 ns

The costs are very close to one another, with Or being very slightly faster on low numbers, and Cmp being slightly faster on the larger sizes. Note that the difference level between them is basically noise. They have the same performance.

The question is, can we do better here?

Looking at the assembly, there is an extra range check in the main loop that the JIT couldn’t elide (the call to items[output++]). Can we do something about it, and would it make any difference in performance? Here is how I can remove the range check:

Here I’m telling the JIT: “I know what I’m doing”, and it shows.

Let’s look at the assembly changes between those two methods, first the prolog:

Here you can see what we are actually doing here. Note the last 4 instructions, we have a range check for the items, and then we have another check for the loop. The first will get you an exception, the second will just skip the loop. In both cases, we test the exact same thing. The JIT had a chance to actually optimize that, but didn’t.

Here is a funny scenario where adding code may reduce the amount of code generated. Let’s do another version of this method:

In this case, I added a check to handle the scenario of items being empty. What can the JIT do with this now? It turns out, quite a lot. We dropped 10 bytes from the method, which is a nice result of our diet.  Here is the annotated version of the assembly:

A lot of the space savings in this case come from just not having to do a range check, but you’ll note that we still do an extra check there (lines 12..13), even though we already checked that. I think that the JIT knows that the value is not zero at this point, but has to consider that the value may be negative.

If we’ll change the initial guard clause to: items.Length <= 0, what do you think will happen? At this point, the JIT is smart enough to just elide everything, we are at 55 bytes of code and it is a super clean assembly (not a sentence I ever thought I would use). I’ll spare you going through more assembly listing, but you can find the output here.

And after all of that, where are we at?

Method N Mean Error StdDev Ratio RatioSD Code Size
FilterCmp 23 274.5 ns 1.91 ns 1.70 ns 1.00 0.00 411 B
FilterCmp_NoRangeCheck 23 269.7 ns 1.33 ns 1.24 ns 0.98 0.01 397 B
               
FilterCmp 1047 744.5 ns 4.88 ns 4.33 ns 1.00 0.00 411 B
FilterCmp_NoRangeCheck 1047 745.8 ns 3.44 ns 3.22 ns 1.00 0.00 397 B
               
FilterCmp 1048599 502,608.6 ns 3,890.38 ns 3,639.06 ns 1.00 0.00 411 B
FilterCmp_NoRangeCheck 1048599 490,669.1 ns 1,793.52 ns 1,589.91 ns 0.98 0.01 397 B
               
FilterCmp 33554455 30,495,286.6 ns 602,907.86 ns 717,718.92 ns 1.00 0.00 411 B
FilterCmp_NoRangeCheck 33554455 29,952,221.2 ns 442,176.37 ns 391,977.84 ns 0.99 0.02 397 B

There is a very slight benefit to the NoRangeCheck, but even when we talk about 32M items, we aren’t talking about a lot of time.

The question what can we do better here?

time to read 4 min | 723 words

Deep inside of the Corax indexing engine inside of RavenDB there is the notion of a posting list. A posting list is just an ordered set of entry ids that contains a particular term. During the indexing process, we need to add and remove items from that posting list. This ends up being something like this:

For fun, go and ask ChatGPT to write you the code for this task.

You can assume that there are no duplicates between the removals and additions, and that adding an existing item is a no-op (so just one value would be in the end result). Here is a quick solution for this task (not actually tested that much, mind, but sufficient to understand what I’m trying to do):

If you look at this code in terms of performance, you’ll realize that this is quite expensive. In terms of complexity, this is actually pretty good, we iterate over the arrays just once, and the number of comparisons is also bounded to the lengths of the list.

However, there is a big issue here, the number of branches that you have to deal with. Basically, every if and every for loop is going to add a tiny bit of latency to the system. This is because these are unpredictable branches, which are pretty nasty to deal with.

It turns out that the values that we put in the posting list are actually always a multiple of 4, so the bottom 2 bits are always cleared. That means that we actually have a different way to deal with it. Here is the new logic:

This code was written with an eye to being able to explain the algorithm, mind, not performance.

The idea goes like this. We flag the removals with a bit, then concatenate all the arrays together, sort them, and then do a single scan over the whole thing, removing duplicates and removals.

In the real code, we are using raw pointers, not a List, so there are no access checks, etc.

From an algorithmic perspective, this code makes absolutely no sense at all. We concatenate all the values together, then sort them (O(NlogN) operation) then scan it again?!

How can that be faster than a single scan across all three arrays? The answer is simple, we have a really efficient sort primitive (vxsort) that is able to sort things really fast (GB/sec). There is a really good series of posts that explain how that is achieved.

Since we consider sorting to be cheap, the rest of the work is just a single scan on the list, and there are no branches at all there. The code plays with the offset that we write into, figuring out whether we need to overwrite the current value (duplicate) or go back (removal), but in general it means that it can execute very quickly.

This approach also has another really important aspect. Take a look at the actual code that we have in production. This is from about an hour worth of profiling a busy indexing session:

image

And the more common code path:

image

In both of them, you’ll notice something really important. There isn’t a call to sorting at all in here. In fact, when I search for the relevant function, I find:

image

That is 25 ms out of over an hour.

How can this be? As efficient as the sorting can be, we are supposed to be calling it a lot.

Well, consider one scenario, what happens if:

  • There are no removals
  • All additions happen after the last existing item in the list

In this case, I don’t need to do anything beyond concatenate the lists. I can skip the entire process entirely, just copy the existing and additions to the output and call it a day.

Even when I do have a lot of removals and complicated merge processes, the code structure means that the CPU can get through this code very quickly. This isn’t super friendly for humans to read, but for the CPU, this is chump change.

time to read 3 min | 479 words

At some point in any performance optimization sprint, you are going to run into a super annoying problem: The dictionary.

The reasoning is quite simple. One of the most powerful optimization techniques is to use a cache, which is usually implemented as a dictionary. Today’s tale is about a dictionary, but surprisingly enough, not about a cache.

Let’s set up the background, I’m looking at optimizing a big indexing batch deep inside RavenDB, and here is my current focus:

image

You can see that the RecordTermsForEntries take 4% of the overall indexing time. That is… a lot, as you can imagine.

What is more interesting here is why. The simplified version of the code looks like this:

Basically, we are registering, for each entry, all the terms that belong to it. This is complicated by the fact that we are doing the process in stages:

  1. Create the entries
  2. Process the terms for the entries
  3. Write the terms to persistent storage (giving them the recorded term id)
  4. Update the entries to record the term ids that they belong to

The part of the code that we are looking at now is the last one, where we already wrote the terms to persistent storage and we need to update the entries. This is needed so when we read them, we’ll be able to find the relevant terms.

At any rate, you can see that this method cost is absolutely dominated by the dictionary call. In fact, we are actually using an optimized method here to avoid doing a TryGetValue() and then Add() in case the value is not already in the dictionary.

If we actually look at the metrics, this is actually kind of awesome. We are calling the dictionary almost 400 million times and it is able to do the work in under 200 nanoseconds per call.

That is pretty awesome, but that still means that we have over 2% of our total indexing time spent doing lookups. Can we do better?

In this case, absolutely. Here is how this works, instead of doing a dictionary lookup, we are going to store a list. And the entry will record the index of the item in the list. Here is what this looks like:

There isn’t much to this process, I admit. I was lucky that in this case, we were able to reorder things in such a way that skipping the dictionary lookup is a viable method.

In other cases, we would need to record the index at the creation of the entry (effectively reserving the position) and then use that later.

And the result is…

image

That is pretty good, even if I say so myself. The cost went down from 3.6 microseconds per call to 1.3 microseconds. That is almost 3 folds improvement.

time to read 2 min | 380 words

I was looking into reducing the allocation in a particular part of our code, and I ran into what was basically the following code (boiled down to the essentials):

As you can see, this does a lot of allocations. The actual method in question was a pretty good size, and all those operations happened in different locations and weren’t as obvious.

Take a moment to look at the code, how many allocations can you spot here?

The first one, obviously, is the string allocation, but there is another one, inside the call to GetBytes(), let’s fix that first by allocating the buffer once (I’m leaving aside the allocation of the reusable buffer, you can assume it is big enough to cover all our needs):

For that matter, we can also easily fix the second problem, by avoiding the string allocation:

That is a few minutes of work, and we are good to go. This method is called a lot, so we can expect a huge reduction in the amount of memory that we allocated.

Except… that didn’t happen. In fact, the amount of memory that we allocate remained pretty much the same. Digging into the details, we allocate roughly the same number of byte arrays (how!) and instead of allocating a lot of strings, we now allocate a lot of character arrays.

I broke the code apart into multiple lines, which made things a lot clearer. (In fact, I threw that into SharpLab, to be honest). Take a look:

This code: buffer[..len] is actually translated to:

char[] charBuffer= RuntimeHelpers.GetSubArray(buffer, Range.EndAt(len));

That will, of course, allocate. I had to change the code to be very explicit about the types that I wanted to use:

This will not allocate, but if you note the changes in the code, you can see that the use of var in this case really tripped me up. Because of the number of overloads and automatic coercion of types that didn’t happen.

For that matter, note that any slicing on arrays will generate a new array, including this code:

This makes perfect sense when you realize what is going on and can still be a big surprise, I looked at the code a lot before I figured out what was going on, and that was with a profiler output that pinpointed the fault.

time to read 3 min | 533 words

Measuring the length of time that a particular piece of code takes is a surprising challenging task. There are two aspects to this, the first is how do you ensure that the cost of getting the start and end times won’t interfere with the work you are doing. The second is how to actually get the time (potentially many times a second) in as efficient way as possible.

To give some context, Andrey Akinshin does a great overview of how the Stopwatch class works in C#. On Linux, that is basically calling to the clock_gettime system call, except that this is not a system call. That is actually a piece of code that the Kernel sticks inside your process that will then integrate with other aspects of the Kernel to optimize this. The idea is that this system call is so frequent that you cannot pay the cost of the Kernel mode transition. There is a good coverage of this here.

In short, that is a very well-known problem and quite a lot of brainpower has been dedicated to solving it. And then we reached this situation:

image

What you are seeing here is us testing the indexing process of RavenDB under the profiler. This is indexing roughly 100M documents, and according to the profiler, we are spending 15% of our time gathering metrics?

The StatsScope.Start() method simply calls Stopwatch.Start(), so we are basically looking at a profiler output that says that Stopwatch is accounting for 15% of our runtime?

Sorry, I don’t believe that. I mean, it is possible, but it seems far-fetched.

In order to test this, I wrote a very simple program, which will generate 100K integers and test whether they are prime or not. I’m doing that to test compute-bound work, basically, and testing calling Start() and Stop() either across the whole loop or in each iteration.

I run that a few times and I’m getting:

  • Windows: 311 ms with Stopwatch per iteration and 312 ms without
  • Linux: 450 ms with Stopwatch per iteration and 455 ms without

On Linux, there is about 5ms overhead if we use a per iteration stopwatch, on Windows, it is either the same cost or slightly cheaper with per iteration stopwatch.

Here is the profiler output on Windows:

image

And on Linux:

image

Now, that is what happens when we are doing a significant amount of work, what happens if the amount of work is negligible? I made the IsPrime() method very cheap, and I got:

image

So that is a good indication that this isn’t free, but still…

Comparing the costs, it is utterly ridiculous that the profiler says that so much time is spent in those methods.

Another aspect here may be the issue of the profiler impact itself. There are differences between using Tracing and Sampling methods, for example.

I don’t have an answer, just a lot of very curious questions.

time to read 9 min | 1766 words

The FastPFor is an integer compression algorithm that was published in 2012 initially. You can read the paper about it here: Decoding billions of integers per second through vectorization.

I’ve run into this algorithm many times in the past. You pretty much can’t be in the database arena and not run into that. It is an interesting paper, and it has a GitHub repository with the code, which is great. Except that I couldn’t figure out what was going on there.

I actually tried stepping through the code a bunch of times, and I always ended up getting lost. The code is in C++ and makes heavy use of templates, containers and non-trivial amounts of magic. To give some context, I gave up when I run into these code snippers:

The paper itself describes the algorithm, but not in a way that actually made sense to me. I tried looking at the code, and I basically noped out. Too complex for me to understand, it seems.

But I kept running into this, and we recently had a strong need for something similar. So I basically took the time to meditate on this for a while.

On a personal level, I realized that I was trying to understand how the whole thing worked by just stepping through the code and reading the paper. The problem was that I was mixing several different layers and wasn’t able to create the proper separation between them.

FastPFor builds upon the previously discussed simdcomp, once you understand that, you can go through this in isolation. For this reason, I first talked about SIMD bit-packing and showed an example usage, before actually diving into how FastPFor itself works.

As a reminder, simdcomp provides routines for quick packing and unpacking of integers, nothing more. FastPFor is where the actual compression happens. Let’s see what is actually going on here. Consider the following list, which we previously discussed:

1871143144

4

4

4

4

4

4

4

4

4

4

7984

4

4

4

4

If we’ll try bit-pack this list, we’ll find out that we aren’t gaining much. The first value is 31 bits in length, after all, and that means that we aren’t able to save much. We talked about Frame of Reference (FOR) as a way to handle that. We can treat every128 block of numbers as a block that has its own reference point and compute the maximum number of bits for that location. That would actually save us a lot of space, but it isn’t ideal. In the case above, most entries in the list can be packed in just 3 bits, except 2. That is where PFor comes into play, which stands for Patched Frame of Reference.

The key aspects of FastPFor are how it figures out what is the best size to use to pack the numbers, the way it detects what should be patched, and the manner in which it stores this.

The code boils down to a single function:

What is going on here? This function accepts an array of integers (whose size is 128) and computes the best number of bits to use to pack it.

How does that work? The first thing this does is compute how many numbers we have for each bit range. This is done using the asmbits() call, which is basically a LeadingZeroCount(). The end result is a frequency map that we can use to estimate how many bits will be required to store the items from this block.

We start from the maximum number of bits that we need to store all the numbers in the block, and we go down, checking what would be the best cost (in terms of space). Since we need to store exceptions somewhere, we compute the cost of that as we go down in the list.

This function gives us:

  • bestb – the number of bits needed to pack all the items in the block that would result in the smallest output size
  • bestcexception – the count of exceptions (numbers that do not fit into bestb bits
  • maxb – the maximum number of bits that we need to store for this block

That is the key function, the one that is responsible for getting the best compression ratio.

What I found that made FastPFor challenging to understand is that it has a lot of state that you need to keep track of. What I described so far is a single block, but FastPFor operates over sets of numbers, and so needs to tie all of them together.

At any given point, FastPFor has:

  • The block that is being outputted
  • The metadata about the entire process
  • 32 arrays of exceptions

The process interleaves all of them together in interesting ways, and I had a hard time figuring out how it all ties together.

Let’s talk about the way FastPFor processes a single block. We process the data in the array in blocks of 128 numbers at a time, like so:

The first thing that happens is the computation of the best bits widths for the current block. Then, we use the bc value to record the metadata about the block.

That is an array of bytes with the following format:

  1. Packed number of bits
  2. Number of exceptions

If there are exceptions for this block, we also add:

  1. Maximum number of bits
  2. Offset of exception in the block (repeated as the number of exceptions)

The metadata is shared for the entire encoding operation, across blocks.

You can see in the code that if bestcexcept (which holds the number of exceptions) is greater than zero, we find the appropriate exceptions buffer to use. That requires a better explanation.

the getBestBFromData() call gave us two important data points, the best number of bits to pack the numbers, and the maximum number. We are going to pack all the numbers in the block to the output, but what about the exceptions? Where do they go?

It turns out that we need to store the exceptions, but we don’t actually need to store max bits, only the difference between the best number of bits and the maximum. This is what thisexceptioncontainer is holding, the remainding bits for exceptions. It’s important to understand that this is done across blocks. So the thisexceptioncontainer value holds exceptions from many different blocks. That will turn out to be very important in just a little bit. We then scan the block and write the remainding bits to the container, and write to the metadata the offset of the value in the block. Since we are using blocks of 128 numbers, this is ensured to fit inside a byte.

The last step that we do for the block is to call: packblockupsimd(), this ends up calling to simdpack() and packs all the numbers from the block to bestb bits in the output.

It’s really important to understand that we still have two data items that haven’t been written. The first is the metadata for the encoding process (bits in blocks, exceptions offsets, etc). The second is the set of exceptions themselves.

This process repeats itself for each block, until the end of the buffer. It is at this point that we need to write the remaining data to the output. Here is what the code looks like:

What is going on? This is dense, and it takes a while to unpack (pun intended) what is going on here.

First, we write the size of the metadata and then we copy the metadata that we have to the output. That is the description of all the blocks that we just went over. Then, we run over the set of exceptions that we gathered. Remember, the datatobepacked is an array that holds lists of exception data for each bit range that we have to store. We iterate over all the bit widths where we wrote exception data and generate a bitmap. That will help us later during the decoding process.

Finally, we run over the same exceptions and write them out. Note that we do that using the same simd bit-packing approach as the rest of the code.

The end result is that we have the following format:

image

For decoding, we start from the metadata offset and jump to that. Then we read the exceptions bitmap. That tells us how many exceptions we have and how many bits we have in each one of them. In the image above, you can see that we have 3 exceptions list, for 4 bits, 12 bits and 19 bits.

We use the simd bit-packing to decode those values into memory. We’ll need them shortly.  Now, we start iterating over the metadata, each block has an overhead of minimum two metadata bytes (number of bits and number of exceptions). Let’s assume that we don’t have any exceptions in the block. In that case, the process of decoding is simple. Just do the unpacking from bits to numbers and you are done.

If we have exceptions, on the other hand, we have to deal with them. At that point, the next metadata byte would contain the maximum number of bits for the exceptions in this block. Using that and the normal number of bits, we can tell where the extra bits are located. Here is the relevant code:

Note that there is a special case here if the difference in the number of bits between the block bits and the maximum number of bits is one. In that case, we don’t need to store the data, we already know what the value is then (it’s one, after all). So we can compute it once and set it in the output.

For scenarios where we can’t tell from the bit width along, we read the relevant exception array based on the difference between the bits in the block and the exceptions bits. That gives us an array that is shared across blocks. The idea is that the metadata contains the offset in the block, and we read from the relevant array one item at a time. This two-step process will result in us setting the right value and getting ready for the next call, which may be done in a different block.

Note that the exception bits buffers only care about the number of bits, not where they come from. So two blocks, which have (bestb: 4, maxb: 9) and (bestb: 7: maxb: 10) will both go to the 3 bits container.

Okay, this blog post is getting long enough, so I think that would be it. Hopefully, it will make it easier to understand exactly how the FastPFor format is working. I’m going to be talking more about FastPFor and the format implications in the next post, then move on to actually using that in C#.

time to read 3 min | 581 words

We care a lot about the performance of RavenDB, like a whole lot. To the point where we have a dedicated team that is continuously burning money CPU cycles testing out all sorts of scenarios with RavenDB. You can see the performance page on the website for some of their results. It got to the point where we stock NVMe drives at the office because we go through them often enough that we need them available for replacement. The benchmark must run, and the numbers must rise, after all.

But the story today isn’t about the costs we pay to reach our performance goals. Rather, it is about a curious little snafu that we ran into when looking at the results. Here are the benchmark results, I intentionally stripped out everything that will give context to this story. What you can see is the same scenario being run on two identical machines, with the difference being the disk that is being used to host the database.

image

In this case, the blue line is io1 disk (high IOPS, low latency, and high costs) versus gp3 (reasonable IOPS, much higher latency, and lower costs). In this case, lower numbers are better.

If you’ll look at the benchmark, you can see that it makes complete sense. RavenDB is a database product, we are running a benchmark, and we use the disk. It’s predictable that the disk latency will have an impact on the performance of the benchmark.

Except… in this case, we are looking at a benchmark that is read-only, and it is meant to run completely from memory. We wrote it so the data size is less than the amount of memory on the system. So why do we have an impact of the disk at all in this case?

We even have a warmup phase before we actually start measuring, to ensure that everything is in memory. Something here does not line up.

After investigating deeper, we discovered the following:

  • When running the automated benchmark, the performance was always correlated to the disk type.
  • When running the same benchmark, manually, there was much better performance and no slowdown related to the slower disk.

That is a really annoying bug, because the fact that we are observing it somehow makes it go away? What is going on?

After a while, we finally figured it out. The problem was the warmup phase. Basically, the warmup is just running the benchmark itself, discarding the results.

Can you guess what the problem was?

The warmup phase is running when the system is cold (naturally), we were hitting the server with enough requests up front that it was unable to process them all (it was queuing on the disk). That meant that a very large portion of the requests in the warmup would timeout or just fail. When we started the benchmark phase, significant portions of the system were still on the disk. So the benchmark become a disk-bound test, with predictable results.

When we ran it manually, we would always do that after the benchmark already run, so our system would be warm (and fast, with no disk access).

The solution for the problem was to scale down the number of requests that the warmup phase is running, to allow gradual loading of the data to memory, without overloading the hardware.

A case where our expectations and what really happened did not line up, creating some… interesting surprises in the end result.

time to read 2 min | 221 words

imageThis Wednesday I’m going to be doing a webinar about RavenDB & Sharding. This is going to be the flagship feature for RavenDB 6.0 and I’m really excited to be talking about it in public finally.

Sharding involves splitting your data into multiple nodes. Similar to having different volumes of a single encyclopedia.

RavenDB’s sharding implementation is something that we have spent the past three or four years working on. That has been quite a saga to get it out. The primary issue is that we want to achieve two competing goals:

  • Allow you to scale the amount of data you have to near infinite levels.
  • Ensure that RavenDB remains simple to use and operate.

The first goal is actually fairly easy and straightforward. It is the second part that made things complicated. After a lot of work, I believe that we have a really good solution at hand.

In the webinar, I’m going to be presenting how RavenDB 6.0 implements sharding, the behavior of the system at scale, and all the details you need to know about how it works under the cover.

I’m really excited to finally be able to show off the great work of the team! Join me, it’s going to be really interesting.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Production postmorterm (2):
    11 Jun 2025 - The rookie server's untimely promotion
  2. Webinar (7):
    05 Jun 2025 - Think inside the database
  3. Recording (16):
    29 May 2025 - RavenDB's Upcoming Optimizations Deep Dive
  4. RavenDB News (2):
    02 May 2025 - May 2025
  5. Production Postmortem (52):
    07 Apr 2025 - The race condition in the interlock
View all series

Syndication

Main feed ... ...
Comments feed   ... ...
}