Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,583
|
Comments: 51,213
Privacy Policy · Terms
filter by tags archive
time to read 8 min | 1514 words

Today I ran into this Reddit post, detailing how Moq is now using SponsorLink to encourage users to sponsor the project.

The idea is that if you are using the project, you’ll sponsor it for some amount, which funds the project. You’ll also get something like this:

lots of thanks from ThisAssembly

This has been rolled out for some projects for quite some time, it seems. But Moq is a far more popular project and it got quite a bit of attention.

It is an interesting scenario, and I gave some thought to what this means.

I’m not a user of Moq, just to note.

I absolutely understand the desire to be paid for Open Source work. It takes a lot of time and effort and looking at the amount of usage people get out of your code compared to the compensation is sometimes ridiculous.

For myself, I can tell you that I made 800 USD out of Rhino.Mocks directly when it was one of the most popular mocking frameworks in the .NET world. That isn’t a sale, that is the total amount of compensation that I got for it directly.

I literally cannot total the number of hours that I spent on it. But OpenHub estimates it as 245 man-years. I… disagree with that estimate, but I certainly put a lot of time there.

From a commercial perspective, I think that this direction is a mistake. Primarily because of the economies of software purchases. You can read about the implementation of SponsorLink here. The model basically says that it will check whether the individual user has sponsored the project.

That is… not really how it works. Let’s say that a new developer is starting to work on an existing project. It is using a SponsorLink project. What happens then? That new developer is being asked to sponsor the project?

If this is a commercial project, I certainly support the notion that there should be some payment. But it should not be on the individual developer, it should be on the company that pays for the project.

That leaves aside all the scenarios where this is being used for an open source project, etc. Let’s ignore those for now.

The problem is that this isn’t how you actually get paid for software. If you are targeting commercial usage, you should be targeting companies, not individual users. More to the point, let’s say that a developer wants to pay, and their company will compensate them for that.

The process for actually doing that is atrocious beyond belief. There are tax implications (if they sponsor with 5$ / month and their employer gives them a 5$ raise, that would be taxed, for example), so you need to submit a receipt for expenses, etc.

A far better model would be to have a way to get the company to pay for that, maybe on a per project basis. Then you can detect if the project is sponsored, for example, by looking at the repository URL (and accounting for forks).

Note that at this point, we are talking about the actual process of getting money, nothing else about this issue.

Now, let’s get to the reason that this caused much angst for people. The way SponsorLink works is that it fetches your email from the git configuration file and check wether:

  • You are registered as a SponsorLink sponsor
  • You are sponsoring this particular project

It does both checks using what appears to be: base62(sha256(email));

If you are already a SponsorLink sponsor, you have explicitly agreed to sharing your email, so not a problem there. So the second request is perfectly fine.

The real problem is the first check, when you check if you are a SponsorLink sponsor in the first place. Let’s assume that you aren’t, what happens then.

Well, there is a request made that looks something like this:

HEAD /azure-blob-storage/path/app/3uVutV7zDlwv2rwBwfOmm2RXngIwJLPeTO0qHPZQuxyS

The server will return a 404 if you are not a sponsor at this point.

The email hash above is my own, by the way. As I mentioned, I’m not a sponsor, so I assume this will return 404. The question is what sort of information is being provided to the server in this case?

Well, there is the hashed email, right? Is that a privacy concern?

It is indeed. While reversing SHA256 in general is not possible, for something like emails, that is pretty trivial. It took me a few minutes to find an online tool that does just that.

The cost is around 0.00045 USD / email, just to give some context. So the end result is that using SponsorLink will provide the email of the user (without their express or implied consent) to the server. It takes a little bit of extra work, but it most certainly does.

Note that practically speaking, this looks like it hits Azure Blob Storage, not a dedicated endpoint. That means that you can probably define logging to check for the requests and then gather the information from there.  Not sure what you would do with this information, but it certainly looks like this falls under PII definition on the GDPR.

There are a few ways to resolve this problem. The first would be to not use email at all, but instead the project repository URL. That may require a bit more work to resolve forks, but would alleviate most of the concerns regarding privacy. A better option would be to just check for an included file in the repository, to be honest. Something like: .sponsored.projects file.

That would include the ids of the projects that were sponsored by this project, and then you can run a check to see that they are actually sponsored. There is no issue with consent here, since adding that file to the repository will explicitly consent for the process.

Assuming that you want / need to use the emails still, the problem is much more complex. You cannot use the same process as k-Anonymity as you can use for passwords. The problem is that a SHA256 of an email is as good as the email itself.

I think that something like this would work, however. Given the SHA256 of the email, you send to the server the following details:

  • prefix = SHA256(email)[0 .. 6]
  • key = read(“/dev/urandom”, 32bytes)
  • hash = HMAC-SHA256(key, SHA256(email)

The prefix is the first 6 letters of the SHA256 hash. The key has cryptography strength of 32 random bytes.

The hash is taking the SHA256 and hashing it again usung HMAC with the provided key.

The idea is that on the server side, you can load all the hashes that you stored that match the provided prefix. Then you compute the keyed HMAC for all of those values and attempt to check if there is a match.

We are trying to protect against a malicious server here, remember. So the idea is that if there is a match, we pinged the server with an email that it knows about. If we ping the server with an email that it does not know about, on the other hand, it cannot tell you anything about the value.

The first 6 characters of the SHA256 will tell you nothing about the value, after all. And the fact that we use a random key to sending the actual hash to the server means that there is no point trying to figure it out.  Unlike trying to guess an email, guessing a hash of an email is likely far harder, to the point that it is not feasible.

Note, I’m not a cryptography expert, and I wouldn’t actually implement such a thing without consulting with one. I’m just writing a blog post with my ideas.

That would at least alleviate the privacy concern. But there are other issues.

The SponsorLink is provided as a closed-source obfuscated library. People have taken the time to de-obfuscate it, and so far it appears to be matching the documented behavior. But the mere fact that it is actually obfuscated and closed-source inclusion in an open-source project raises a lot of red flags.

Finally, there is the actual behavior when it detects that you are not sponsoring this project. Here is what the blog post states will happen:

A diagnostics warning in VS suggesting you install SponsorLink

It will delay the build (locally, on your machine, not on CI).

That… is really bad. I assume that this happens on every build (not sure, though). If that is the case, that means that the feedback cycle of "write a test, run it, write code, run a test", is going to hit significant slowdowns.

I would consider this to be a breaking point even excluding everything else.

As I previously stated, I’m all for paying for Open Source software. But this is not the way to do that, there is a whole bunch of friction and not much that can indicate a positive outcome for the project.

Monetization strategies for Open Source projects are complex. Open core, for example, likely would not work for this scenario. Nor would you be likely to get support contracts. The critical aspect is that beyond just the technical details, any such strategy requires a whole bunch of infrastructure around it. Marketing, sales, contract negotiation, etc. There is no easy solution here, I’m afraid.

time to read 1 min | 83 words

I’m going to QCon San Francisco and will be teaching a full day workshop where we’ll start from a C compiler and  an empty file and end up with a functional storage engine, indexing and more.

Included in the minimum requirements are implementing transactions, MVCC, persistent data structures, and indexes.

The workshop is going to be loosely based on the book, but I’m going to condense things so we can cover this topic in a single day.

Looking forward to seeing you there.

time to read 9 min | 1711 words

In my previous post I discussed how we could store the exact same information in several ways, leading to space savings of 66%! That leads to interesting questions with regard to actually making use of this technique in the real world.

The reason I posted about this topic is that we just gained a very significant reduction in memory (and we obviously care about reducing resource usage). The question is whether this is something that you want to do in general.

Let’s look at that in detail. For this technique to be useful, you should be using structs in the first place. That is… not quite true, actually. Let’s take a look at the following declarations:

We define the same shape twice. Once as a class and once as a structure. How does this look in memory?

Typelayoutfor'PersonClass'Size:32bytes.Paddings:2bytes%12ofObjectHeader8bytesMethodTablePtr8bytes1619:Int32Id4bytes2021:UInt16Kids2bytes2223:padding2bytes2432:DateTimeBirthday8bytesemptyspaceTypelayoutfor'PersonStruct'Size:24bytes.Paddings:10bytes%41of03:Int32Id4bytes47:padding4bytes815:DateTimeBirthday8bytes1617:UInt16Kids2bytes1823:padding6bytesemptyspace

Here you can find some really interesting differences. The struct is smaller than the class, but the amount of wasted space is much higher in the struct. What is the reason for that?

The class needs to carry 16 bytes of metadata. That is the object header and the pointer to the method table. You can read more about the topic here. So the memory overhead for a class is 16 bytes at a minimum. But look at the rest of it.

You can see that the layout in memory of the fields is different in the class versus the structure. C# is free to re-order the fields to reduce the padding and get better memory utilization for classes, but I would need [StructLayout(LayoutKind.Auto)] to do the same for structures.

The difference between the two options can be quite high, as you can imagine. Note that automatically laying out the fields in this manner means that you’re effectively declaring that the memory layout is an implementation detail. This means that you cannot persist it, send it to native code, etc. Basically, the internal layout may change at any time.  Classes in C# are obviously not meant for you to poke into their internals, and LayoutKind.Auto comes with an explicit warning about its behavior.

Interestingly enough, [StructLayout] will work on classes, you can use to force LayoutKind.Sequential on a class. That is by design, because you may need to pass a part of your class to unmanaged code, so you have the ability to control memory explicitly. (Did I mention that I love C#?)

Going back to the original question, why would you want to go into this trouble? As we just saw, if you are using classes (which you are likely to default to), you already benefit from the automatic layout of fields in memory. If you are using structs, you can enable LayoutKind.Auto to get the same behavior.

This technique is for the 1% of the cases where that is not sufficient, when you can see that your memory usage is high and you can benefit greatly from manually doing something about it.

That leads to the follow-up question, if we go about implementing this, what is the overhead over time? If I want to add a new field to an optimized struct, I need to be able to understand how it is laid out in memory, etc.

Like any optimization, you need to maintain that. Here is a recent example from RavenDB.

image

In this case, we used to have an optimization that had a meaningful impact. The .NET code changed, and the optimization now no longer makes sense, so we reverted that to get even better perf.

At those levels, you don’t get to rest on your laurels. You have to keep checking your assumptions.

If you got to the point where you are manually optimizing memory layouts for better performance, there are two options:

  • You are doing that for fun, no meaningful impact on your system over time if this degrades.
  • There is an actual need for this, so you’ll need to invest the effort in regular maintenance.

You can make that easier by adding tests to verify those assumptions. For example, verifying the amount of padding in structs match expectation. A simple test that would verify the size of a struct would mean that any changes to that are explicit. You’ll need to modify the test as well, and presumably that is easier to catch / review / figure out than just adding a field and not noticing the impact.

In short, this isn’t a generally applicable pattern. This is a technique that is meant to be applied in case of true need, where you’ll happily accept the additional maintenance overhead for better performance and reduced resource usage.

time to read 35 min | 6841 words

Consider a warehouse that needs to keep track of items. For the purpose of discussion, we have quite a few fields that we need to keep track of. Here is how this looks like in code:

And the actual Warehouse class looks like this:

The idea is that this is simply a wrapper to the list of items. We use a struct to make sure that we have good locality, etc.

The question is, what is the cost of this? Let’s say that we have a million items in the warehouse. That would be over 137MB of memory. In fact, a single struct instance is going to consume a total of 144 bytes.

That is… a big struct, I have to admit. Using ObjectLayoutInspector I was able to get the details on what exactly is going on:

Type layout for 'WarehouseItem'
    Size: 144 bytes. Paddings: 62 bytes (%43 of empty space)
    07:Int64ticks8bytes07:UInt64dateData8bytes07:UInt64dateData8bytes07:UInt64dateData8bytes015:Nullable`1ProductDimensions16bytes0:BooleanhasValue1byte13:padding3bytes415:Dimensionsvalue12bytes03:SingleLength4bytes47:SingleWidth4bytes811:SingleHeight4bytes1631:Nullable`1ExternalSku16bytes0:BooleanhasValue1byte17:padding7bytes815:Int64value8bytes3247:Nullable`1ShelfLife16bytes0:BooleanhasValue1byte17:padding7bytes815:TimeSpanvalue8bytes4855:Nullable`1AlcoholContent8bytes0:BooleanhasValue1byte13:padding3bytes47:Singlevalue4bytes5671:Nullable`1ProductionDate16bytes0:BooleanhasValue1byte17:padding7bytes815:DateTimevalue8bytes7279:Nullable`1RgbColor8bytes0:BooleanhasValue1byte13:padding3bytes47:Int32value4bytes8081:Nullable`1IsHazardous2bytes0:BooleanhasValue1byte1:Booleanvalue1byte8283:padding2bytes8491:Nullable`1Weight8bytes0:BooleanhasValue1byte13:padding3bytes47:Singlevalue4bytes9299:Nullable`1Quantity8bytes0:BooleanhasValue1byte13:padding3bytes47:Int32value4bytes100103:padding4bytes104119:Nullable`1ArrivalDate16bytes0:BooleanhasValue1byte17:padding7bytes815:DateTimevalue8bytes120121:Nullable`1Fragile2bytes0:BooleanhasValue1byte1:Booleanvalue1byte122127:padding6bytes128143:Nullable`1LastStockCheckDate16bytes0:BooleanhasValue1byte17:padding7bytes815:DateTimevalue8bytes

As you can see, there is a huge amount of wasted space here. Most of which is because of the nullability. That injects an additional byte, and padding and layout issues really explode the size of the struct.

Here is an alternative layout, which conveys the same information, much more compactly. The idea is that instead of having a full byte for each nullable field (with the impact on padding, etc), we’ll have a single bitmap for all nullable fields. Here is how this looks like:

If we look deeper into this, we’ll see that this saved a lot, the struct size is now 96 bytes in size. It’s a massive space-savings, but…

Type layout for 'WarehouseItem'
Size: 96 bytes. Paddings: 24 bytes (%25 of empty space)

We still have a lot of wasted space. This is because we haven’t organized the struct to eliminate padding. Let’s reorganize the structs fields to see what we can achieve. The only change I did was re-arrange the fields, and we have:

And the struct layout is now:

Typelayoutfor'WarehouseItem'Size:72bytes.Paddings:0bytes%0ofemptyspace011:DimensionsProductDimensions12bytes03:SingleLength4bytes47:SingleWidth4bytes811:SingleHeight4bytes1215:SingleAlcoholContent4bytes1623:Int64ExternalSku8bytes2431:TimeSpanShelfLife8bytes3239:DateTimeProductionDate8bytes4047:DateTimeArrivalDate8bytes4855:DateTimeLastStockCheckDate8bytes5659:SingleWeight4bytes6063:Int32Quantity4bytes6467:Int32RgbColor4bytes68:BooleanFragile1byte69:BooleanIsHazardous1byte7071:UInt16nullability2bytes

We have no wasted space, and we are 50% of the previous size.

We can actually do better, note that Fragile and IsHazarous are Booleans, and we have some free bits on _nullability that we can repurpose.

For that matter, RgbColor only needs 24 bits, not 32. Do we need alcohol content to be a float, or can we use a byte? If that is the case, can we shove both of them together into the same 4 bytes?

For dates, can we use DateOnly instead of DateTime? What about ShelfLife, can we measure that in hours and use a short for that (giving us a maximum of 7 years)?

After all of that, we end up with the following structure:

And with the following layout:

03:Int32dayNumber4bytes03:Int32dayNumber4bytes03:Int32dayNumber4bytesTypelayoutfor'WarehouseItem'Size:48bytes.Paddings:0bytes%0ofemptyspace011:DimensionsProductDimensions12bytes03:SingleLength4bytes47:SingleWidth4bytes811:SingleHeight4bytes1215:SingleWeight4bytes1623:Int64ExternalSku8bytes2427:DateOnlyProductionDate4bytes2831:DateOnlyArrivalDate4bytes3235:DateOnlyLastStockCheckDate4bytes3639:Int32Quantity4bytes4043:Int32rgbColorAndAlcoholContentBacking4bytes4445:UInt16nullability2bytes4647:UInt16ShelfLifeInHours2bytes

In other words, we are now packing everything into  48 bytes, which means that we are one-third of the initial cost. Still representing the same data. Our previous Warehouse class? It used to take 137MB for a million items, it would now take 45.7 MB only.

In RavenDB’s case, we had the following:

That is the backing store of the dictionary, and as you can see, it isn’t a nice one. Using similar techniques we are able to massively reduce the amount of storage that is required to process indexing.

Here is what this same scenario looks like now:

But we aren’t done yet , there is still more that we can do.

time to read 3 min | 417 words

A customer called us, quite upset, because their RavenDB cluster was failing every few minutes. That was weird, because they were running on our cloud offering, so we had full access to the metrics, and we saw absolutely no problem on our end.

During the call, it turned out that every now and then, but almost always immediately after a new deployment, RavenDB would fail some requests. On a fairly consistent basis, we could see two failures and a retry that was finally successful.

Okay, so at least there is no user visible impact, but this was still super strange to see. On the backend, we couldn’t see any reason why we would get those sort of errors.

Looking at the failure stack, we narrowed things down to an async operation that was invoked via DataDog. Our suspicions were focused on this being an error in the async machinery customization that DataDog uses for adding non-invasive monitoring.

We created a custom build for the user that they could test and waited to get the results from their environment. Trying to reproduce this locally using DataDog integration didn’t raise any flags.

The good thing was that we did find a smoking gun, a violation of the natural order and invariant breaking behavior.

The not so good news was that it was in our own code. At least that meant that we could fix this.

Let’s see if I can explain what is going on. The customer was using a custom configuration: FastestNode. This is used to find the nearest / least loaded node in the cluster and operate from it.

How does RavenDB know which is the fastest node? That is kind of hard to answer, after all. It checks.

Every now and then, RavenDB replicates a read request to all nodes in the cluster. Something like this:

The idea is that we send the request to all the nodes, and wait for the first one to arrive. Since this is the same request, all servers will do the same amount of work, and we’ll find the fastest node from our perspective.

Did you notice the cancellation token in there? When we return from this function, we cancel the existing requests. Here is what this looks like from the monitoring perspective:

image

This looks exactly like every few minutes, we have a couple of failures (and failover) in the system and was quite confusing until we figured out exactly what was going on.

time to read 3 min | 541 words

We got a support call from a client, in the early hours of the morning, they were getting out-of-memory errors from their database and were understandably perturbed by that. They are running on a cloud system, so the first inclination of the admin when seeing the problem was deploying the server on a bigger instance, to at least get things running while they investigate. Doubling and then quadrupling the amount of memory that the system has had no impact. A few minutes after the system booted, it would raise an error about running out of memory.

Except that it wasn’t actually running out of memory. A scenario like that, when we give more memory to the system and still have out-of-memory errors can indicate a leak or unbounded process of some kind. That wasn’t the case here. In all system configurations (including the original one), there was plenty of additional memory in the system. Something else was going on.

When our support engineer looked at the actual details of the problem, it was quite puzzling. It looked something like this:

System.OutOfMemoryException: ENOMEM on Failed to munmap at Sparrow.Server.Platform.Posix.Syscall.munmap(IntPtr start, UIntPtr length);

That error made absolutely no sense, as you can imagine. We are trying to release memory, not allocate it. Common sense says that you can’t really fail when you are freeing memory. After all, how can you run out of memory? I’m trying to give you some, damn it!

It turns out that this model is too simplistic. You can actually run out of memory when trying to release it. The issue is that it isn’t you that is running out of memory, but the kernel. Here we are talking specifically about the Linux kernel, and how it works.

Obviously a very important aspect of the job of the kernel is managing the system memory, and to do that, the kernel itself needs memory. For managing the system memory, the kernel uses something called VMA (virtual memory area). Each VMA has its own permissions and attributes. In general, you never need to be aware of this detail.

However, there are certain pathological cases, where you need to set up different permissions and behaviors on a lot of memory areas. In the case we ran into, RavenDB was using an encrypted database. When running on an encrypted database, RavenDB ensures that all plain text data is written to memory that is locked (cannot be stored on disk / swapped out).

A side effect of that is that this means that for every piece of memory that we lock, the kernel needs to create its own VMA. Since each of them is operated on independently of the others. The kernel is using VMAs to manage its own map of the memory. and eventually, the number of the items in the map exceeds the configured value.

In this case, the munmap call released a portion of the memory back, which means that the kernel needs to split the VMA to separate pieces. But the number of items is limited, this is controlled by the vm.max_map_count value.

The default is typically 65530, but database systems often require a lot more of those. The default value is conservative, mind.

Adjusting the configuration would alleviate this problem, since that will give us sufficient space to operate normally.

time to read 4 min | 631 words

On its face, we have a simple requirement:

  • Generate sequential numbers
  • Ensure that there can be no gaps
  • Do that in a distributed manner

Generating the next number in the sequence is literally as simple as ++, so surely that is a trivial task, right?

The problem is with the second requirement. The need to ensure that there are no gaps comes often when dealing with things like invoices. The tax authorities are really keen on “show me all your invoices”, and if there are gaps in the numbers, you may have to provide Good Answers.

You may think that the third one, running in a distributed environment, is the tough challenge, but that isn’t actually the case. If we are running in a single location, that is fairly easy. Run the invoice id generation as a transaction, and you are done. But the normal methods of doing that are usually wrong in edge cases.

Let’s assume that we use an Oracle database, which uses the following mechanism to generate the new invoice id:

invoice_seq.NEXTVAL

Or SQL Server with an identity column:

CREATE TABLE invoices ( invoice_id INT IDENTITY(1,1) PRIMARY KEY, ... );

In both cases, we may insert a new value to the invoices table, consuming an invoice id. At some later point in time, we may roll back the transaction. Care to guess what happens then?

You have INVOICE #1000 and INVOICE #1002, but nothing in between. In fact, no way to even tell what happened, usually.

In other words, identity, sequence, serial, or autonumber – regardless of what database platform you use, are not suitable for generating gapless numbers.

The reasoning is quite simple. Assume that you have two concurrent transactions, which generate two new invoices at roughly the same time. You commit the later one before the first one, and roll back the first. You now have:

  • Invoice #1
  • Invoice #2
  • Invoice #1000
  • Invoice #1002

However, you don’t have Invoice #1001, and you cannot roll back the sequence value there, because if you do so, it will re-generate the #1002 on the next call.

Instead, for gapless numbers, we need to create this as a dedicated part of the transaction. So there would be a record in our system that contains the NextInvoiceId, which will be incremented as part of the new invoice creation.

In order to ensure that there are no gaps, you need to ensure that the NextInvoideId record increment is handled as a user operation, not a database operation. In other words, in SQL Server, that is a row in a table, that you manually increment as part of adding a new invoice. Here is what this will look like:

As you can see, we increment the row directly. So it will be rolledback as well.

The downside here is that we can no longer create two invoices concurrently. The second transaction would have to wait on the lock on the row in the next_gapless_ids table.

All of that happens inside a single database server. What happens when we are running in a distributed environment?

The answer in this case, is the exact same thing. You need a transaction, a distributed one, using a consensus algorithm. Here is how you can achieve this using RavenDB’s cluster wide transactions, which use the Raft protocol behind the scenes:

The idea is simple, we have a transaction that modifies the gapless ids document and creates a new invoice at the same time. We have to handle a concurrency exception if two transactions try to create a new invoice at the same time (because they both want to use the same invoice id value), but in essence this is pretty much exactly the same behavior as when you are running on a single node.

In other words, to ensure the right behavior, you need to use a transaction. And if you need a distributed transaction, that is just a flag away with RavenDB.

time to read 2 min | 356 words

After this saga, I wanted to close the series with some numbers about the impact of this algorithm.

If you’ll recall, I started this whole series discussing variable-sized integers. I was using this list of numbers to compare the values. There are 44,956 items in the list.

Algorithm Size
Raw 359.648
Varint 224,780
Delta + Varint    45,970
Delta + Varint + Gzip             5,273
FastPFor      24,717

You can see some interesting details about the various options. Delta + Varint + Gzip is a really good setup, if you can assume that the output pattern has a high degree of repetition. In some cases, that is possible, but that isn’t a generic property. Aside from that, FastPFor is winning hands down in terms of the output size of the data.

There is also another important aspect. The size of the output here is too big to fit in a single 8KB buffer, so I’ll need to use multiple for that. This is not something that you can really do with GZip. Here is the cost across multiple buffers:

This particular list would fit into 4 pages:

  • 14,080 entries at 8,048 bytes
  • 14,080 entries at 8,063 bytes
  • 15,616 entries at 8,030 bytes
  •   1,180 entries at    644 bytes

But the compression ratio is only part of that. Let’s talk about the performance aspect. On my machine, you can run the encoding process in 1,115 ticks and decoding in just 462 ticks.

To give some context, that means that you can do the encoding at a rate of ~400,000,000 / second and decoding at a rate of about 1 billion per second.

My perf team also asked me to mention that they haven’t gotten the chance to look at the code yet, so things are likely to improve.

The entire premise of FastPFor inside of RavenDB relies on these fast decoding & encoding times. It makes it super cheap to iterate over a vast amount of numbers, and the B+Tree layout we have means that updating a posting list’s page is trivial. Read it, merge, and encode it back. It’s fast enough that there is really no other place for meaningful optimizations / complexity.

time to read 6 min | 1029 words

In the previous post, I discussed FastPFor encoding, now I’m going to focus on how we deal with decoding. Here is the decode struct:

image

Note that this is a struct for performance reasons. We expect that we’ll have a lot of usage here, so saving the allocation here ends up paying high dividends.  Before we’ll start, I want to pay special attention to the fields on the decoder:

Of particular interest is the _exceptionOffsets array. If you aren’t familiar with it, this is a fixed-size array on the struct.

Here is the constructor for the decoder:

We are being provided with the encoded buffer and its size. What is actually happening here?

We start by allocating memory for the exceptions buffer, then scan the exceptions bitmap and extract the exceptions data to the buffer. We use a single buffer for all of the exceptions, and the _exceptionsOffsets is an indication of where each bit width exception is currently at.

Finally, we set the _prev to be a vector of the baseline. That is the counterpoint to how we dealt with the values during the encoding. Like the encoding process, we have to deal with three states:

  • Big deltas (for the next block)
  • Variable integer (at the end)
  • Packed block of 256 numbers

We’ll start with dealing with the first two options:

If we have a big delta, we record that in the bigDeltaOffsets. Note that we are playing an annoying game here. I’m actually storing the metadata position in the delta offsets. I can cast it to long and then to pointer because I’m ensured that the data is fixed during the operation.

For the varint scenario, the last item in the batch, we don’t have to do much, we compute the running total of the values as we read them from the input, that’s about it.

Things get a lot more interesting when we start dealing with the actual compressed blocks:

Here we simply decode the values to the buffer, then we need to apply the exceptions. You can see that we saved some space by not storing exceptions of 1 bit (since we know what the value would be). For exceptions of different sizes, see how we consume the exception from _exceptionOffsets for each used exception. I’m using a ref variable here, so the offset++ operation will increment the value in the array. That ensures that we have to keep very little state in the decoding process as it moves forward. Remember that we require that the output buffer for the decoded numbers be at least 256 values, to ensure that we can fit them all in. That doesn’t mean that we have enough space to fit everything. So we may be called multiple times and need to resume from where we left off.

Finally, we set the expectedBufferIndex if we have a big delta offset. We’ll shortly see what we’ll do with this.

Remember that at this point, the buffer we use has int32 deltas in it, not the actual final numbers. In order to get the final values, we need to compute the sum of the deltas, this is handled here:

What do we do here? We load a vector of 8 integers (32 bits) and widen that to two vectors of 4 integers (64 bits) each.

We then check whether this is the expected buffer index for big delta offsets, if so, we’ll or the high parts of the number back to the vector. This is handled here:

There are quite a few things that are happening here all at once. We first read the current delta offset from the metadata, and read 16 bytes into a Vector128. We then translate that vector to a vector of 256 bits. That would basically add zeros to the vector. We set the next index to check and then we shuffle the vector on numbers to arrange it on the high parts of the vector as longs.

Let’s say that we had the numbers [1,2,3,4] in the metadata, we read them into the Vector128, like so:

Vector128 = [1,2,3,4]

We then extended that to a Vector256, like so:

Vector256 = [1,2,3,4,0,0,0,0]

Finally, we shuffle the values, (in this case, 7 is known to be zero) so we have:

highBitsDelta = [0,1,0,2,0,3,0,4]

We convert that to a vector of longs (with the same bits pattern) and then or that with the values from the packed deltas.

We have to do the expectedBufferIndex twice on each iteration, but we expect that to be rare, so the branch predictor is going to be really good in predicting this.

Finally, we have the part where we compute the prefix sum, here:

This looks like black magic, but let’s break it apart into individual pieces. 

Let’s assume that we started out with [25,27,30, 35] – using our approach, we have Baseline: 21 and the deltas: [4,2,3,5].

We start with prev being set to the baseline value on all elements in the vector:

prev = [21,21,21,21]

And cur is set to deltas:

cur = [4,2,3,5]

On line 5, we shuffle and BitwiseAnd the value with a mask, let’s see what we get:

Line 5 – shuffled: [4,4,2,3]

LIne 5- masked: [0,4,2,3]

We add that to cur, giving us:

cur = [4 + 0, 4 + 2, 3 + 2, 5 + 3] = [4,6,5,8]

On line 7, we do another shuffle & mask:

Line 7 – shuffled: [4,4,4,6]

Line 7 – masked: [0,0,4,6]

And adding that to cur, will give us:

cur = [4+0, 6+ 0, 5 + 4, 8+ 6] = [4,6,9,14]

We now add the prev vector, giving us:

cur = [4 + 21, 6 + 21, 9 + 21, 14 + 21] = [25, 27, 30, 35]

We computed the sum again, hurray!

We then set all elements of prev to the last value of cur:

prev = [35,35,35,35]

And now we are ready for the next element to compute.

And… this is pretty much it, to be honest. We keep reading until we are out of space in the buffer or consumed all the metadata. We are able to decode the data really fast, and it only took a 10 parts (so far) series of blog posts to explain.

In the next post, I’m going to discuss what we can see from actually using this with real numbers.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Production postmorterm (2):
    11 Jun 2025 - The rookie server's untimely promotion
  2. Webinar (7):
    05 Jun 2025 - Think inside the database
  3. Recording (16):
    29 May 2025 - RavenDB's Upcoming Optimizations Deep Dive
  4. RavenDB News (2):
    02 May 2025 - May 2025
  5. Production Postmortem (52):
    07 Apr 2025 - The race condition in the interlock
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}