Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,589
|
Comments: 51,218
Privacy Policy · Terms
filter by tags archive
time to read 3 min | 415 words

A customer called us, complaining that RavenDB isn’t supporting internationalization. That was a big term to unpack. It boiled down to a simple issue. They were using Hebrew text in their system, consuming us from a node.js client, and they observed that sometimes, RavenDB would corrupt the data.

They would get JSON similar to this:

{ “Status”: "�", “Logged: true }

That… is not good. And also quite strange. I’m a native Hebrew speaker, so I threw a lot of such texts into RavenDB in the past. In fact, one of our employees built a library project for biblical texts, naturally all in Hebrew. Another employee maintained a set of Lucene analyzers for Hebrew. I think that I can safely say that RavenDB and Hebrew has been done. But the problem persisted. What was worse, it was not consistent. Every time that we tried to see what is going on, it worked.

We added code inside of RavenDB to try to detect what is going on, and there was nothing there. Eventually we tried to look into the Node.js RavenDB client, because we exhausted everything else. It looked okay, and in our tests, it… worked.

So we sat down and thought about what it could be. Let’s consider the actual scenario we have on hand:

  • Hebrew characters in JSON are being corrupted.
  • RavenDB uses UTF-8 encoding exclusively.
  • That means that Hebrew characters are using multi byte characters

That line of thinking led me to consider that the problem is related to chunking. We read from the network in chunks, and if the chunk happened to fall on a character boundary, we mess it up, maybe?

Once I started looking into this, the fix was obvious:

image

Here we go: ‍!

This bug is a great example of how things can not show up in practice for a really long time. In order to hit this you need chunking to happen in just the wrong place, and if you are running locally (as we usually do when troubleshooting), the likelihood you’ll see this is far lower. Given that most JSON property names and values are in the ASCII set, you need a chunk of just the right size to see it. Once we know about it, reproducing it is easy, just create a single string that is full of multi byte chars (such as an emoji) and make it long enough that it must be chunked.

The fix was already merged and released.

time to read 5 min | 817 words

I often end up being pulled into various debugging sessions. I find that having a structured approach for debugging can help you figure out what is going on and fix even the nastiest of bugs. This is a story about a few different bugs that I run into.

We have some new features coming in RavenDB 5.4 which touch the manner in which we manage some data. When we threw enough data at RavenDB, it died. Quite rudely, I have to say. I described the actual bug here, if you care. There wasn’t really much to finding this one.

The process crashes, we don’t know why. The process will usually gives us good signs about what exactly is happening. In this case, attaching a debugger and running the code and seeing where it failed was sufficient. I’ll admit that I mostly pressed F10 and F11 until I found the offending function, and then I didn’t believe it for some time.

Still in the same area, however, we began to run into a problem with impossible results. Invariants were broken, quite harshly. The problem was that this was something we could only reproduce after ~30 minutes or so of running a big benchmark. This does not make it easy to figure out what is going on. What is worse, there is a distinct gap between the time that the bug showed up and when it actually happened.

That made it very hard to figure out what is going on. This is where I think the structured approach shines.

Architecture – I hate concurrency debugging. I’m roughly 20+ years of experience in writing parallel programs, it is not sufficient to debug concurrency, I’m afraid. As a result, the design of RavenDB goes to great lengths to avoid having to do concurrency coordination. There are very well defined locations where we are applying concurrency (handling requests) and there are places where we are using serial manner (modifying data). The key is that we are using multiple serial processes at the same time. Each index is bound to a thread, for example, but they are independent of one another. The error in question was in an index, and I was able to narrow it down to a particular area of the code. Somewhere along the way, we messed things up.

Reproducing – I had a way to reproduce the issue, it was just utterly unhelpful for debugging purposes. But since the process is a serial one, it meant that it is also fairly consistent. So I added a bunch of printf() that logged the interesting operations that we had:

image

And here is what this looked like:

image

Each line is a particular command that I’m executing. I wrote a trivial parser that would read from the file and perform the relevant operations.

For ease of explanation, imagine that I’m writing a hash table, and this set of operations prime it to expose the issue.

The file contained hundreds of millions of operations. Somewhere toward the end, we run into the corrupted state. That is too much to debug, but I don’t need to do that. I can write code to verify the state of the data structure.

There is one problem here, the cost. Verifying the data structure is an operation that is O(N*logN + N) at least (probably O(N^2), to be honest, didn’t bother to run the proper analysis). I can’t run it on each command.  The good news is that I don’t need to.

Here is what I did:

That did a full verification every 100,000 runs. This meant that I was confident that this was roughly the error way. Then I just repeated the process, raising the minimum start for doing the check and reducing the gaps in the frequencies.

This is a pretty boring part, to be honest. You run the code, do some emails, check it again, update a couple of numbers, and move on. This bisection approach yields great results, however. It means that I could very quickly narrow down to the specific action that caused the error.

That was in step 48,292,932, by the way. The actual error was discovered far later.

Fixing – that part I’m going to skip, this is usually fairly obvious once you know what is going on. In the case above, there was some scenario where we were off by 4 bytes, and that caused… havoc.

The key here is that for most of the debugging process, I don’t really need to do any sort of thinking. I figure out what information I need to gather, then I write the code that would do that for me.

And then I basically keep running things until I narrowed the field down to the level where the problem is clear and I can fix that.

time to read 2 min | 249 words

Yesterday I presented a bug that killed the process in a particularly rude manner. This is a recursive function that guards against stack overflows using RuntimeHelpers.EnsureSufficientExecutionStack().

Because of how this function kills the process, it took some time to figure out what is going on. There was no StackOverflowException, just an exit code. Here is the relevant code:

This looks okay, we optimize for zero allocations on the common path (less than 2K items), but also handle the big one.

The problem is that our math is wrong. More specifically, take a look at this line:

var sizeInBytes = o.Count / (sizeof(byte) * 8) + o.Count % (sizeof(byte) * 8) == 0 ? 0 : 1;

Let’s assume that your count is 10, what do you think the value of this is going to be?

Well, it looks like this should give us 2, right?

10 / 8 + 10%8 == 0 ? 0 :1

The problem is in the operator precedence. I read this as:

(10 / 8) + (10 % 8 == 0 ? 0 : 1)

And the C# compiler read it as:

(10 / 8 + 10 % 8) == 0 ? 0 : 1

In other words, *#@*!*@!.

The end result is that we overwrite past our allocated stack. Usually that doesn’t do anything bad, since there is enough stack space. But sometimes, if the stack is aligned just right, we cross into the stack guard page and kill the process.

Opps, that was not expected.

time to read 1 min | 171 words

The following code is something that we ran into yesterday, under some conditions, this code will fail with a stack overflow. More specifically, the process crashes and the return code is –1073740791 (or as it is usually presented: 0xC0000409.

At this point in my career I can look at that error code and just recall that this is the Windows error code for a stack overflow, to be more precise, this is: STATUS_STACK_BUFFER_OVERRUN

That… makes sense, I guess, this is a recursive code, after all. Let’s take a look:

Except, that this code explicitly protects against this. Note the call to:

RuntimeHelpers.EnsureSufficientExecutionStack();

In other words, if we are about the run out of stack space, we ask the .NET framework to throw (just before we run out, basically).

This code doesn’t fail often, and we tried to push deeply nested structure through that, and we got an InsufficientExecutionStackException thrown.

Sometimes, however, when we run this code with a relatively flat structure (2 – 4 levels), it will just die with this error.

Can you spot the bug?

time to read 3 min | 592 words

A customer called the support hotline with a serious problem. They had a large database and wanted to add another replica node to it. This is a fairly standard thing to do, of course. The problem was that somewhere around the 70% mark, the replication process stalled. All the metrics were green, the mentor node and the new node had perfect connectivity, and there were no errors in the logs.

Typical reasons for replication to stall usually involve connectivity issues, but in this case, we could see that there was no such sign of that. In fact, the mentor node kept sending (empty) batches to the destination node. That shouldn’t be the case, however. If we have nothing to send, there shouldn’t be a batch sent over the wire. That was the only hint of something wrong.

We also looked into what information RavenDB could tell us about the system, and noticed that we have a performance hint about big documents. Some of them exceeded 32MB in size, which is… quite a lot.  That doesn’t really relate so much to replication, however. It would surely slow it down, but it should work.

Looking into the logs, we could see that the mentor node was attempting to send a batch, but it was sending zero documents. Digging deeper, we saw an entry about skipping documents, that was… strange. Cross referencing the log statement with the source code revealed that RavenDB decided that it is sending too much in the batch and aborted it. But… it isn’t sending anything in the batch.

What is actually going on is that the database in question is an encrypted one. Encrypted databases in RavenDB are encrypted in both disk and memory. The only time that we decrypt a document is when there is an active transaction reading it. During that time, we hold that in locked memory (so it wouldn’t be paged to disk). As a result of that, we try to limit the size of transactions in encrypted databases. When we replicate data between nodes, we open a read transaction on the source node, read the documents that we need to replicate and send them to the other side.

There is a small caveat here, each node in an encrypted database can use a different encryption key, so we aren’t sending the encrypted data, but the plain text. Of course, the communication itself is encrypted, so nothing can peek into the data in the middle.

By default, we’ll stop a replication batch in an encrypted database after we locked more than 64 MB of memory. A replication batch of 64 MB is plenty big enough, after all. However… we didn’t take into account a scenario where a single document may cause us to consume more than 64 MB in locked memory. And we put the check to close the replication batch very early in the process.

The sequence of operations was then:

  • Start a replication batch
  • Load the first document to send
  • Realize that we locked too much memory and close the batch
  • Send a zero length batch

Rinse and repeat, since we can’t make any forward progress.

The actual solution was to set the “Replication.MaxSizeToSendInMb” configuration option to a higher value, enough to send even the biggest documents the customer has. At that point, there was forward progress again in the system and the replication was completed successfully.

We still consider this a bug, and we’ll fix it so there won’t be a hang in the system, but I’m happy to see that we were able to do a configuration change and get everything up to speed so quickly.

time to read 4 min | 759 words

imageA typical production postmortem story is a tale of daring dives deep into the guts of your system. It is a journey into the intricacies of dependencies between multiple components, the delicate balance of distributed processes that got just the wrong level of alignment to cause some havoc. A production postmortem is a toil of mystery that can last for weeks. This isn’t one of those tales, however. In this one, the entire thing was wrapped up within fifteen minutes. So what was the issue?

The initial premise was pretty straightforward. A customer was running RavenDB in production, but due to their topology, their RavenDB instances are not exposed to the outside world directly. Instead, they route the connection to Azure Web Application Firewall and through Azure Front Door. I have no comment on the actual decision to route through those firewalls. The problem the customer had was that Azure Front Door doesn’t support web-sockets, RavenDB studio makes extensive use of them for a bunch of reasons and there are certain features that are also dependent on them (such as aggressive caching, the Changes() API, etc).

The customer wanted everything to work, and asked if RavenDB can support a long polling method, to avoid the issue entirely.

This is an XY Problem.

There was much confusion to be had, between our support team, yours truly and the technical people of the customer. Here is the issue, the problem the customer experienced is simply not possible. There is absolutely no way that they can run into this issue.

Here is the deal:

  • RavenDB is a secured-by-default database, which assumes that it is always running in a hostile environment.
  • For security, RavenDB uses TLS 1.2 or higher to safeguard the data in transit.
  • For authentication, RavenDB uses mutual authentication for both client & server using X509 certificates.

Take those three together and you’ll realize that the very design of RavenDB forces you to do SSL termination (here I’m using TLS & SSL as interchangeable terms) at the RavenDB process directly. We have to do it in this manner, since otherwise we wouldn’t be able to validate the certificate from the client.

The customer in this case was running in a secured mode, but was completely unable to use web sockets.

Again, that is not possible. Let me explain why.

If RavenDB is the entity that does SSL termination (in this case, doing the cryptographic handshake, authentication, etc) then anything in the middle between RavenDB and the client is dealing with an encrypted stream of bytes that are indistinguishable from random noise.

In other words, there shouldn’t be a way to not support web sockets, since any proxy in the middle shouldn’t be able to tell what the content of the request is.

This design by RavenDB also prevents you from forwarding requests, since the SSL stream must reach to RavenDB directly (as-is). Otherwise, RavenDB will not be able to authenticate the client certificate.

When we looked at the actual server in question, it quickly became apparent what the issue was. The customer was accessing RavenDB using HTTPS, as is proper. However, RavenDB itself was not configured to run in a secured manner. In other words, the client was accessing RavenDB using HTTPS, but the proxies in the middle will then connect to RavenDB itself using HTTP (no security). That means that RavenDB talks to the proxy with no encryption and the proxy is able to see into the requests. That leads, of course, to the situation where the supported feature set of the proxy impacts what capabilities RavenDB can utilize.

This is a broken setup, I want to point out. It is also a highly misleading setup, because RavenDB is running in unsecured mode, but you are using HTTPS to access it. We intend to make this configuration setup raise an alert and block this from deployments. RavenDB goes to great lengths to ensure that you won’t have those pitfalls to stumble into. I have to admit that we have never actually considered this sort of setup as a scenario. I am strongly reminded of this.

RavenDB is amenable to running behind a proxy, of course. The key to doing so successfully is that the proxy is responsible for TCP traffic only, never interfering with the (encrypted) content that goes over the wire. As a result of this requirement, we don’t need to worry about the capabilities of the various proxies. As long as it is able to support TCP connections, all features of RavenDB will work.

time to read 3 min | 457 words

We run into a really interesting bug. For some reason, the system was behaving in a totally unexpected manner for some parts of the data. For pretty much the same input, we would get the wrong result, and we couldn’t figure out why.

Here is our source data:

This is some metrics data about servers, and you’ll note that we report the CPU load for each core on the instance and that the results are sorted based on the actual load. Here is what this looks like, the image on the left is wrong while the image on the right is right (pun intended).

image

Why do we have this behavior?

Well, let’s look at the actual data, shall we?

image

They are… the same. Exactly the same, in fact. We can throw that into diff engine and they will tell me that they are identical (except for the document id).

What is going on here?

Well, here is the issue, what you see is not what you get. Look at the JSON text that I have above, and compare that to the documents we see in the images. RavenDB shows the documents in a nicely formatted manner, and along the way, it messed up something pretty important.

In our case, we used an object to hold the various details about the instances. And we relied that the insertion sort order for the properties would stay the same when reading the document. That is actually the case, and RavenDB goes to great lengths to ensure that this is the case. However…

In order to prettify the document, we call to JSON.parse() and JSON.stringify() (on the client side), which give us nicely formatted output. Along the way, however, we run into JavaScript and its “ideas” about how things should work. In particular, JavaScript threats properties whose key is a number in a different way than other values. All the numeric properties will be sorted according to their integer value, while non numeric values will be sorted using insertion order.

That only applies to documents that were modified in the studio, however. The RavenDB server and client API are keeping the properties in insertion order. Only if you modified the document using the Studio will you get this. But because we always show the documents in the same manner, it was invisible to us for a long while.

For that matter, it took an embarrassingly long time of debugging this problem, because (naturally) whenever we viewed the data, we did that with formatting, which meant that we never actually saw the differences between the raw versions of the documents.

time to read 2 min | 294 words

I read this post and it took me very little time to spot a pretty nasty bug. Here is the relevant section:

The bug is on line #4. The code assumes that a call to read() will return less than the requested number of bytes only at the end of the file. The problem with that approach is that this is explicitly documented to not work this way:

It is not an error if the returned value n is smaller than the buffer size, even when the reader is not at the end of the stream yet. This may happen for example because fewer bytes are actually available right now (e. g. being close to end-of-file) or because read() was interrupted by a signal.

This is a super common error in many cases. And in the vast majority of the cases, that would work. Except when it wouldn’t.

The underlying implementation of File::read() will call read() or ReadFile(). ReadFile() (Windows) is documented to read as much as you requested, unless you hit the end of file. The read() call, on Unix, is documented to allow returning less than requested:

It is not an error if this number is smaller than the number of bytes requested

Aside from signals, the file system is free to do a partial read if it has some of the data in memory and some not. I’m not sure if this is implemented in this manner, but it is allowed to do so. And the results for the code above in this case are absolutely catastrophic (decryption will fail, encryption will emit partial information with no error, etc).

I’m writing this blog post because reading the code made the error jump at me. Was bitten by this assumption too many times.

time to read 8 min | 1507 words

The topic of this post is a bug in RavenDB, a pretty serious one. The end result is that a user reported that they got an error from RavenDB that they are unable to read a stored document. In some cases, RavenDB needs to read a document on startup, which means that it wasn’t able to start up if that document had this behavior.

As you can imagine, this is one of those issues that gets our full and immediate attention. The error itself gave us a lot of information:

 Dictionary mismatch on Dic #375
   at Voron.Data.Tables.ZstdLib.AssertSuccess(UIntPtr v, CompressionDictionary dictionary)

This is related to RavenDB’s document compression behavior. In order to get a great compression ratio from our documents, we train RavenDB on the recent documents that you have and generate a compression dictionary. The problem at hand is that the compression dictionary we have and the compression dictionary that was actually used are different. As you can see from the error, we are using zstd as the compression algorithm. When zstd generates a dictionary it will (by default) generate an id from that document that is mostly based on the xxhash64 of its content, rounded to 32 bits. You can see the relevant part here. This is pretty nice, since it means that there is a good chance that we’ll detect the wrong dictionary.

So now we know what is going on, but we don’t understand why.

When we wrote this feature, we were quite aware that we’ll not be able to make any sort of sense from the documents if we don’t have the right dictionary. For that reason, we store the dictionaries three times. Once inside of RavenDB itself and twice in ancillary files, which we can use during recovery. This sort of error should be utterly impossible. And yet, we had run into that in production, so we have to dig deeper still.

The primary suspect was the dictionary training portion. One of the things that RavenDB does on a continuous basis is measure the compression ratio of the documents, if we aren’t able to hit a good compression ratio, RavenDB will try to generate a new dictionary from the most recent documents and see if that new dictionary can do better. This can be very helpful in maintaining good compression rates. As your documents change, RavenDB will detect that and realize that it can do better, retrain on the recent data and compress even further. The problem is that this code path is also quite tricky, we first compress the document using the current dictionary, then we try generating a new dictionary and see if compressing with the new dictionary is better. If that is the case, we can install the new dictionary for future operations, otherwise, we need to discard it.

I suspected that the issue was somewhere around that area, we might not be handling the rejection of the new dictionary properly. So I went into the code and started digging, but I found absolutely nothing. The entire process is covered in tests and has been in production for close to 18 months, so this isn’t something that obvious.

After spending quite a bit of time on the issue, I decided that the code is perfect, it handled everything properly and taken into account all the right behaviors.

Clearly the fault was elsewhere. Before setting out to blame the nearest cat (you can never trust those), I had an idea, what if the problem wasn’t during the training process, but afterward?

Well, that doesn’t really matter, does it? RavenDB is a transactional database, if we had a failure after the training process, we’ll have to discard some of the data, for sure, but that would be about it. Unless, what if we have some state that wasn’t transactional? As part of looking at the compression training code, I ran into just such a scenario. Running the training to generate a new compression dictionary is an expensive proposition, so we don’t want to do that often. As such, we’ll do that for only about 1K document changes where we exceed the desired compression ratio by over 10%. How do we know to act every 1K documents? Well, we have a counter that we increment on every change. That value is incremented using Interlocked.Increment() and isn’t part of the transactional state. If the transaction is aborted, the value is still incremented.  The actual value doesn’t matter, mind, only that it is moving forward, so that isn’t an issue.

I mentioned the dictionary id before, but I should clarify that this is the zstd’s dictionary id. Internally, RavenDB uses a different value. That value is simply the sequence number of the dictionary, RavenDB counts the number of generated dictionaries and gives the new dictionary the next available value. That value, by the way, is part of the transaction. If we rollback a transaction, we’ll use the same dictionary id. But that doesn’t matter, of course.

When using compression dictionaries, we need to load them from a buffer. There is quite a bit of work that is involved in that, there is memory allocation, entropy tables to load, etc. In order to save repeated work, RavenDB caches the compression dictionaries (after all, their whole point is to be used repeatedly). That cache can be used by multiple transactions at the same time (two read transactions using the same dictionary will use the same instance).

Given all of this information, here is the sequence of events that we need to get the error in question:

  1. The user enabled documents compression.
  2. The user runs a transaction with at least four commands, which needs to satisfy the following conditions.
  3. A document write as the first action.
  4. Then a write to document whose compression ratio exceeded the expected ratio by over 10%, as a result, RavenDB tried to train a new compression dictionary.
  5. That dictionary had a better compression ratio and was accepted as the new default compression dictionary.
  6. RavenDB persisted the new dictionary and used that to compress the new document.
  7. Another command (in the same transaction) had stored a document in the same collection, now RavenDB will read the new dictionary and store that in a cache.
  8. A third command runs, but this one throws an error (such as optimistic concurrency violation).

At this point, RavenDB will rollback the entire transaction and return the error to the user. Let’s say the user has chosen to submit the same two documents again, shall we?

For the first command, we’ll again discover that the compression ratio (of the old compression dictionary) is insufficient. We will not generate a new compression dictionary, why is that? Remember the counter that we increment using Interlocked? That one was not rolled back, so we’ll need to wait for another 1K documents for the stars to properly align for us. That doesn’t impact correctness in any way, shape or form, however.

At this stage, the stage is set, but everything is still okay. The problem will happen on the next time that we’ll trigger a new dictionary. At that point, we’ll again scan the most recent documents, build a dictionary, etc. However, the dictionary id that RavenDB will use will be identical to the dictionary id that we previously discarded. The data that dictionary was trained on, however, will almost certainly be different. We persist the new dictionary to disk and everyone is happy, the new document that we wrote will use the new compression dictionary and we are perfectly fine.

The next write for this collection, however, will run into a problem. It will need to use the current (the new one) dictionary when we want to make a write. In order to do that, it will load the value using the cache, but there is already a value for that dictionary in the cache, the same dictionary that was discarded. At this point, RavenDB will start compressing documents using the in memory dictionary while the on disk dictionary is different.

If you’ll try to access the document which triggered the new dictionary, you’ll get an error, but documents that were modified later will continue working with no issue. Until you restart, of course.

On restart, we’ll read the dictionary from disk, where we wrote the new dictionary, at this point, all those documents that we wrote will give us the error above. Note that the sequence of events has to be very exact, you need to have a dictionary training as part of a multi act transaction which failed after the dictionary training has been successful and wrote additional documents. In a year and a half of production usage and very heavy load, that happened only a couple of times, it seems.

The issue has been fixed, of course and we’ll be rolling it out to both users and cloud customers. We’ll now rollback such in memory state on a transaction rollback as well, avoiding this issue entirely. It is amazing to me that despite very careful planning, it wasn’t the code itself that caused a problem, but a sequence of independent operations and failure modes that we never even considered about this.

FUTURE POSTS

  1. RavenDB and Gen AI Security - 3 days from now
  2. RavenDB & Distributed Debugging - 6 days from now
  3. RavenDB & Ansible - 11 days from now

There are posts all the way to Jul 22, 2025

RECENT SERIES

  1. RavenDB 7.1 (7):
    11 Jul 2025 - The Gen AI release
  2. Production postmorterm (2):
    11 Jun 2025 - The rookie server's untimely promotion
  3. Webinar (7):
    05 Jun 2025 - Think inside the database
  4. Recording (16):
    29 May 2025 - RavenDB's Upcoming Optimizations Deep Dive
  5. RavenDB News (2):
    02 May 2025 - May 2025
View all series

Syndication

Main feed ... ...
Comments feed   ... ...
}