In the previous post, I talked about a massive amount of effort (2+ months of work) and about 25,000 lines of code changes. The only purpose of this task was to remove two locks from the system. During high load, we spent huge amounts of time contending for these locks, so removing them was well worth the effort.
During this work, I essentially found myself in the guts of Voron (RavenDB’s storage engine) and mostly dealing with old code. I’m talking about code that was written between 10 and 15 years ago. I wrote a blog post about it at the time. Working with old code is an interesting experience, especially since most of this code was written by me. I can remember some of my thoughts from the time I wrote it.
Old code is working code, and old code is also something that was built upon. Other parts of the codebase are making assumptions about the way the system behaves. And the more time a piece of code doesn't change, the more likely its behavior is going to ossify. Changing old code is hard because of the many ways that such dependencies can express themselves.
I dug through all of this decade-plus old code and I realized something pretty horrible.
It turns out that I made a mistake in understanding how Windows implements buffering for memory-mapped files. I realized my mistake around mid-2024, see the related post for the
actual details.
The TLDR summary, however, is that when using unbuffered file I/O with memory-mapped files on Windows, you cannot expect the mapped memory to reflect the data written using the file I/O API. Windows calls it coherence, and it was quite confusing when I first realized what the problem was. It turns out that this applies only to unbuffered I/O and there is no such problem with buffered I/O.
The scenario I needed to work with can use buffered I/O, however, which has been a profound shock to me. Large portions of the architecture of Voron are actually shaped by this limitation.
Because I thought that you couldn’t use both file I/O and memory-mapped files at the same time in Windows and get a consistent view of the data (the documentation literally says that, I have to add), RavenDB used memory-mapped I/O to write to the data file. That is a choice, certainly, but not one that I particularly liked. It was just that I was able to make things work and move on to other things.
This is another tale of accidental complexity, actually. I had a problem and found a solution to it, which at the time I considered quite clever. Because I had a solution, I never tried to dig any deeper into it and figure out whether this is the only solution.
This choice of using only memory-mapped I/O to write to the data file had consequences. In particular, it meant that:
- We had to map the data using read-write mode.
- There was no simple way to get an error if a write failed - since we just copied the data to memory, there was no actual write to fail. An error to write to disk would show up as a memory access violation (segmentation fault!) or just not show up at all.
- Writing to a page that isn’t in memory may require us to read it first (even if we are overwriting all of it).
I accepted those limitations because I thought that this was the only way to do things. When I realized that I was wrong, that opened up so many possibilities. As far as the refactoring work, the way Voron did things changed significantly. We are now mapping the data file as read-only and writing to it using file I/O.
That means we have a known point of failure if we fail to write. That probably deserves some explanation. Failure to write to the disk can come in a bunch of ways. In particular, successfully writing to a file is not enough to safely store data, you also need to sync the file before you can be assured that the data is safe. The key here is that write + sync ensures that you’ll know that this either succeeded or failed.
Here is the old way we were writing to the data file. Conceptually, this looks like this:
auto mem = EnsureFileSize(pagesToWrite[pagesToWriteLength - 1].EndPosition);
for(auto i = 0; i < pagesToWriteLength; i++)
{
auto path = pagesToWrite[i];
memcpy(mem + page.Number * 8192, page.Buffer, page.Length);
}
// some later time
if(FAILED(msync(mem))
return SYNC_FAILURE;
And here is the first iteration of using the file I/O API for writes.
fallocate_if_needed(pagesToWrite[pagesToWriteLength - 1].EndPosition);
for(auto i = 0; i < pagesToWriteLength; i++)
{
auto path = pagesToWrite[i];
if(FAILED(pwrite(page.Number * 8192, page.Buffer, page.Length)))
return WRITE_FAILURE;
}
// some time later
if (FAILED(fdatasync(file))
return SYNC_FAILURE;
Conceptually, this is just the same, but notice that we respond immediately to write failures here.
When we started testing this feature, we realized something really interesting. The new version was much slower than the previous one, and it also generated a lot more disk writes.
I moved mountains for this?
Sometimes you get a deep sense of frustration when you look at benchmark results. The amount of work invested in this change is… pretty high. And from an architectural point of view, I’m loving it. The code is simpler, more robust, and allows us to cleanly do a lot more than we used to be able to.
The code also should be much faster, but it wasn’t. And given that performance is a critical aspect of RavenDB, that may cause us to scrap the whole thing.
Looking more deeply into the issue, it turned out that my statement about old code and the state of the system was spot on. Take a look at the two code snippets above and consider how they look from the point of view of the operating system. In the case of the memcpy()
version, there is a strong likelihood that the kernel isn’t even involved (the pages are already paged in), and the only work done here is marking them as dirty (done by the CPU directly).
That means that the OS will figure out that it has stuff to write to the disk either when we call msync()
or when its own timer is called. On the other hand, when we call pwrite()
, we involve the OS at every stage of the process, making it far more likely that it will start the actual write to the disk earlier. That means that we are wasting batching opportunities.
In other words, because we used memory-mapped writes, we (accidentally, I might add) created a situation where we tried very hard to batch those writes in memory as much as possible. Another aspect here is that we are issuing a separate system call for each page. That means we are paying another high price.
The good thing about this is that we now have a good way to deal with those issues. The pwrite()
code above was simply the first version used to test things out. Since we now have the freedom to run, we can use whatever file I/O we want.
In particular, RavenDB 7.1 now supports the notion of write modes, with the following possible options:
mmap
- exactly like previous versions, uses a writable memory map andmemcpy()
to write the values to the data file.file_io
- usespwrite()
to write the data, onc page at a time, as shown above.vectored_file_io
- usespwritev()
to write the data, merging adjacent writes to reduce the number of system calls we use (Posix
only, since Windows has strict limits on this capability).io_ring
- usesHIORING (Windows)
/IO_Uring
(Linux) to submit the whole set of work to the kernel as a single batch of operations.
RavenDB will select the appropriate mode for the system on its own, usually selecting io_ring
for modern Linux and Windows machines, and vectored_file_io
for Mac. You can control that using the RAVEN_VORON_WRITER_MODE
environment variable, but that is provided only because we want to have an escape hatch, not something that you are meant to configure.
With those changes, we are on a much better footing for overall performance, but we aren’t done yet! I would love to give you the performance numbers, but we didn’t actually run the full test suite with just these changes. And that is because we aren’t done yet, I’ll cover that in the next post.