Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,520
|
Comments: 51,141
Privacy Policy · Terms
filter by tags archive
time to read 3 min | 580 words

rodentia-icons_fsguard-plugin-urgent-300pxI talked about the details of Voron in the previous posts, how it handles journaling, MVCC and cleaning up after itself. In this post, I want to focus on another aspect that needs to be considered, the various costs of running  Voron on production systems. In particular, the competing I/O requirements.

So what do we have with Voron?

  • A (potentially very large) memory mapped data file. Buffered writes and fsync once every 1 minute / 2GB.
  • Scratch files (small memory mapped files) marked as temporary and delete on close.
  • Journal files requiring durable writes.

In terms of priorities, we want to give high priority to the journal files, then to writing to the data file (so it will happen all the time, not just when we call fsync). Scratch files should only be written to disk under memory pressure, and we should strive to avoid that if possible.

On both Windows and Linux, there are ways to ask the system to start flushing the data to disk (Windows uses FlushViewOfFile, Linux uses sync_file_range), but in practice, when we flush the data to disk we need to also ensure durability, so we call FlushViewOfFile + FlushFileBuffers on Windows and msync(MS_SYNC) on Linux to ensure that. Technically speaking, we could do this in two stages, allowing the system some time to do this lazily, then calling FlushFileBuffers / fsync, but we haven’t found that to be advantageous in terms of complexity, and sync_file_range documentation is scary.

Another aspect that we need to consider is the fact that we are not along out there. A typical RavenDB database will have multiple Voron instances running, and a typical RavenDB server will have multiple RavenDB databases running. So we are talking about typically having dozens or more Voron instances in a single process. We need to avoid a conflict between all of those instance, each of which is trying to make use of all the system resources by itself. This kind of disharmony can kill the performance of the server, all the while giving the best performance in any benchmark where you are running a single instance.

We solved this by having a single actor responsible for scheduling the flushing of all the Voron instances inside a process. It accept flush requests and make sure that we aren’t loading the I/O system too much. This means that we might actually defer flushing to disk under load, but in practice, reducing the I/O competition is going to improve throughput anyway, so that is likely to be better in the end. At the same time, we want to take advantage of the parallelism inherit in many high end systems (RAID, cloud, etc) which can handle a lot of IOPS at the same time. So the policy is to give a certain number of Voron instance the chance to run in parallel, with adjustments depending on the current I/O load on the system.

Journal writes, however, happen immediately, have high priority and should take precedent over data file writes, because they have immediate impact on the system.

We are also experimenting with using the operation system I/O priorities, but that is a bit hard, because most of those are about reducing the I/O priorities. Which we sort of want, but not that much.

time to read 4 min | 751 words

imagebot

In the previous post, I talked about the usage of scratch files to enable MVCC and the challenges that this entails. In this post, I want to talk about the role the transaction journal files play in all of this. I talked a lot about how to ensure that transaction journals are fast, what goes into them, etc. But this post is how  they are used inside Voron.

The way Voron stores data inside the transaction journal is actually quite simple. We have a transaction header, which contains quite a bit of interesting information, and then we have all the pages that were modified in this transaction, compressed.

image

The fact that we are compressing pages can save on a lot of the amount of I/O we write. But the key aspect here is that a transaction is considered committed by the Voron when we complete the write of the entire thing to stable storage. See the post above to a complete discussion on why it matters and how to do this quickly and with the least amount of pain.

Typically, the transaction journal is only used during recovery, so it is write only. We let the journal files to grow to about 64MB in size, then we create new ones. During database startup, we check what is the last journal file and journal file position that we have synced (more on that later), and we start reading from there. We read the transaction header and compare its hash to the hash of the compressed data. If they match (as well as a bunch of other checks we do), then we consider this to be a valid commit, and then we decompress the data into a temporary buffer and we have all the dirty pages that were written in that transaction.

We can then just copy them to the appropriate location in the data file. We continue doing so until we hit the end of the last file or we hit a transaction which is invalid or empty. At that point we stop, consider this the end of the valid committed transactions, and complete recovery.

Note that at this point, we have written a lot of stuff to the data file, but we have flushed it. The reason is that flushing is incredibly expensive, especially during data recovery where we might be re-playing a lot of data. So we skip it.  Instead, we rely on the normal flushing process to do this for us. By default, this will happen within 1 minute of the database starting up, in the background, so it will reduce the interruption for regular operations. This gives us a very fast startup time. And our in memory state let us know where is the next place we need to flush from the log, so we don’t do the same work twice.

However, that does mean that if we fail midway through, there is absolutely no change in behavior. In recovery, we’ll write the same information to the same place, so replaying the journal file become idempotent operation that can fail and recover without a lot of complexity.

We do need to clear the journal files at some point, and this process happens after we synced the data file. At that point, we know that the data is safely stored in the data file, and we can update our persistent state on where we need to start applying recovery the next time. Once those two actions are done, we can delete the old (and now unused) journal files. Note that at each part of the operation, the failure mode is to simply retry the idempotent operation (copying the pages from the journal to the data file), and there is no need for complex recovery logic.

During normal operation, we’ll clear the journal files once it has been confirmed that all the data it has was successfully flushed to the disk and that this action has been successfully recorded in stable storage. So in practice, database restarts after recovery are typically very fast, only needing to reply the last few transactions before we are ready for business again.

time to read 5 min | 1000 words

In my previosus post, I talked about how Voron achieves MVCC. Instead of modifying data in place, we copy the page or pages we want to modify to a scratch buffer and modify that. When the write transaction completes, we are updating a Page Translation Table so any reference to the pages that were modified would go to the right place in the scratch file.

Note, Voron uses mmap files as scratch buffers. I use the term scratch buffer / scratch file to refer to the same thing.

That is all well and good, and if you are familiar with how virtual memory works, this is exactly the model. In effect, every transaction get a snapshot of the entire database as it was when it was opened. Read transactions don’t modify the data, and are ensured to have a stable snapshot of the database. The write transaction can modify the database freely, without worrying about locking or stepping over other transactions.

This is all pretty simple, and the sole cost that we have when committing the transaction is flushing all the dirty pages to disk, and then making an atomic pointer swap to update the Page Translation Table.

However, that is only part of the job, if all the data modifications happens on the scratch buffer, what is going on with the scratch files?

Voron has a background process that monitor the database activity, and based on certain policy (size, time, load factor, etc) it will routinely write the data from the scratch files to the data file. This is a bit of an involved process, because we can’t just do this blindly.

Instead, we start by seeing what is the oldest active transaction that is currently operating. We need to find that out to make sure that we aren’t writing any page that this transaction might visit (thus violating the snapshot isolation of the transaction). Once we have the oldest transaction, we gather all the pages from the Page Translation Table that came from older transactions and write them to the data file. There are a couple of tricks that we use here. It is very frequent for the same page to be modified multiple times (maybe we updated the record several times in different transactions), so we’ll have multiple copies of it. But we don’t actually need to copy all of them, we just need to copy the latest version (up to the oldest active transaction).

The process of copying all the data from the scratch file to the data file can happen concurrently with both read and write transactions. After the flush, we need to update the PTT again (so we open a very short write transactions to do that), and we are done. All the pages that we have copied from the scratch buffer are marked as free and are available for future transactions to use. 

Note, however, that we haven’t called fsync on the data file yet. So even though we wrote to the data file, we made a buffered write, which is awesome for performance, but not so much for safety. This is done intentionally, for performance reasons. In my next post, I’ll talk about recovery and safety at length, so I’ll just mention that we fsync the data file once a minute or one once every 2GB or so. The idea is that we give the OS the time to do the actual flush on the background, before we just in and demand that this will happen.

Another problem that we have with the scratch buffer is that, like any memory allocation routine, it has issues. In particular, it has to deal with fragmentation. We use power of two allocator to reduce fragmentation as much as possible, but certain workloads can fragment the memory in such a way that it is hard / impossible to deal with it. In order to deal with that issue, we keep track on not just the free sections in the scratch buffer, but also on the total amount of used memory. If a request cannot be satisfied by the scratch buffer because of fragmentation, but there is enough free space available, we’ll create a new scratch file and use that as our new scratch. The old one will eventually be freed when all read transactions are over and all the data has been flushed away.

Scratch files are marked as temporary and delete of close, so we don’t actually incur a high I/O cost when we create new ones, and it typically only when we have very high workload of both reads and writes that we see the need to create new scratch files.This tend to be drastically cheaper than trying to do compaction, and it actually work in all cases, while compaction can fail in many cases.

You might have noticed an issue with the whole system. We can only move pages from the scratch file to the data file if it was modified by a transaction that is older than the oldest current transaction. That means that a long running read transaction can stall the entire process. This typically is only a problem when we are seeing very high write usage as well as very long read transactions, which pushes the envelope on the size of the scratch buffer but at the same time doesn’t allow to clean it.

Indeed, using Voron, you are typically aware on the need to close transactions in a reasonable timeframe. Within RavenDB, there are very few places where a transaction can span a long time (streaming is pretty much the only case in which we’ll allow it, and it is documented that if you have a very long streaming request, that push memory usage on the server up because we can’t clean the transaction). In practice, even transactions that takes multiple minutes are fine under moderate write load, because there is enough capacity to handle it.

time to read 5 min | 804 words

I talked about the different aspects of building a database engine in detail in the past month or so. But I tried to talk about each topic independently, so it will make sense. The problem is that in the real world, there are actually quite a lot of related stuff that impact on one another. This series of posts is meant to tie everything together, so you’ll have a better understanding how the design decisions in one place being affected by the requirement in somewhere that seems utterly unrelated.

Before we can talk about the implementation details, let us see what we are trying to achieve. Voron is:

  • High performance.
  • Single write, multiple readers (MVCC)
  • Fully ACID

In this post, I’m not going to talk about the data model, or how we sort it, or anything like that. No, we are at a much lower level than that. We are at how we access the raw data pages and manage them.

There are actually multiple players involved here. We have the journal for durability of writes, we have the data file to store the data, the scratch file to implement Multi Versioning Concurrency Control and the Page Translation Tables to provide snapshot isolation for concurrent transactions.

The  design of Voron is immensely simplified by the fact that we choose to go with a single writer model. We share this design decision with other databases engines such as LMDB, LevelDB, RocksDB, etc. Concurrent write transactions are much more complex and require a lot more effort, and you still have the serialization at the journal level, although I explored multiple ways around it. With Voron, we decided to go with a single write transaction for the simplicity, and then implemented transaction merging on top of that, which give us a tremendous performance boost in high load scenarios.

But let us talk about MVCC. The idea is that we have concurrent versions of the data, so each transaction has a snapshot of the entire database and can operate on that without fear of write transactions modifying data while it is running. Let us explore how this works when the database starts.

The key to that is the notion of the page translation table, from now on, known as the PTT. When the database starts, we have an empty PTT, and the data file itself. We open a read transaction, which has the following data:

ReadTx-1:

  • PTT: [ /* no entries */
  • Data file

Whenever the read transaction need to read a page, it consults the PTT, find that there is nothing there, and read the page from the data file. We keep the read transaction open, and open a new write transaction. It also gets a PTT and the data file, but it also needs to keep track of a few other things:

WriteTx-2:

  • PTT: [/* no entries */]
  • Data file
  • Dirty pages

Now, we want to make a change to the database, which happens to fall on Page #3. Here we have problem, we can’t modify the data file directly, ReadTx-1 is still running, and it might want to read the data in Page #3 at some point. Instead of modifying the data directly, we copy the page into the scratch file.

The scratch file is just a temporary file that we use to store data copies. After we copy the data, we update the PTT. Now when we search for Page #3, we’ll find the location of the page in the scratch file. As far as the write transaction is concerned, this doesn’t matter. A page is a page is a page, and it doesn’t matter where it is at.

Committing the transaction means taking all of the dirty pages in the write transaction and writing them to the log. After which we atomically set the PTT for the write transaction as the global PTT. Now, all future read transactions will have the new PTT and when they will ask for Page #3, they will get the page from the scratch file.

A new write transaction that needs to (again) modify Page #3, will create another copy of the Page inside the scratch file.This ends up looking like this:

image

We have three copies of Page #3. One for the original read transaction (in the data file), one for the current read transactions (yellow in the scratch file) and the current modified page (orange in the scratch file) that we are writing to.

When the write transaction completes, we again flush the dirty pages to the journal and then publish our PTT so all future transactions can see the changes.

Of course, that is just one side of it, in my next post, I’ll discuss how we clear the scratch file and move data back to the data file.

time to read 3 min | 522 words

I mentioned that maintaining physical ids is important for performance reasons in my previous post, but I skipped on exactly why. The short answer is that if I have a physical ids, it is much easier to implement locality and much easier to implement parallel locality.

Let us imagine a database whose size is about 100GB, running on a machine that has 6 GB of RAM. You need to do run some sort of computation that traverse the graph, but doing so naively will likely cause us to trash quite a lot, as we page memory in and out of the disk, only to jump far away in the graph, paging even more, and effectively killing all your performance.

Instead, we can do something like this, let us imagine that you have a machine with 4 cores on it, and the previous mention setup. And then you start 4 threads (each marked with a different color on the image, and start processing nodes.

image

However, there is a trick here, each thread has a queue, and only ids that fall without the area of responsibility of the thread will arrive there. But we aren’t done, inside a thread we define additional regions, and route requests to process each region into each own queue.

Finally, within each thread, we process one region at a time. So the idea is that while we are running over a region, we may produce work that will need to run on other regions (or even other threads), but we don’t care, we queue that work and continue emptying the work that exists on our own region. Only once once we have completed all work in a particular region will we move to the next one. The whole task complete when, in all threads, there are no more regions with work to be done.

Note that the idea here is that each thread is working on one region at a time, and that region maps to a section of the database file that was memory mapped. So we keep that are of the page cache alive and well.

When we move between regions, we can hint to the memory manager that we are going to need the next region, etc. We can’t escape the need to process the same region multiple times, because processing in one region may lead us to processing in another, and then back, but assuming we run the regions using least recently accessed, we can take advantage on the stuff remaining in the page cache (hopefully) from the previous run and using that.

Which is why the physical location on disk is important.

Note that the actual query that we run is less important. Typical graph queries are fall into one of two categories:

  • Some sort of Breadth First Search or Depth First Search and walking through the graph. 
  • Finding a sub-graph in the larger graph that matches this criteria.

In both cases, we can process such queries using the aforementioned process, and the reduction in random work that the database has to do is big.

time to read 4 min | 647 words

A few posts ago, I talked about the problem of having unstable ids, in particular, ids that can be reused. That leads to quite a lot of complexity, as anyone who ever had to deal with Lucene documents ids knows.

So we are willing to pay something toward stable ids, the questions is what?

One way of doing that is to just store the physical id (unstable) and a virtual id (stable) in a B+Tree (actually, a pair of them, since you’ll need to refer to them back and forth). That means that for the most part, internally to the engine, we’ll use the physical id (with its nice property of having O(1) access time), but externally we’ll expose the stable virtual id (probably sequential numbering, since that is easiest).

Note that I still want to use the physical ids, I’ll discuss exactly why that is important in my next post, for now, let us just say that it is an important component of ensuring high performance for large datasets.

The problem with using B+Tree is that the cost of finding the virtual <—> physical id mapping is O(logN), which for 10 million nodes and 100 million edges is 23 & 24 respectively. Except that this isn’t the real cost function for B+Tree.

Assuming that we have 255 items per page, we actually would need to do 4 page lookups, and a total of 54 comparisons to find the right value. For the edges, we would need 5 page look ups and over 60 comparisons.  Note that this isn’t an issue on its own, but it is an issue when we are talking about having this kind of cost in the hot path of the application. And this is very likely going to be in the hot path.

Oh, there are ways around it, we can only translate back and forth at the edges of the database, so internally we’ll always use the physical address, and only translate it out when we are done. But that is hard to actually do properly, since you need the virtual address for a whole lot of stuff all over the place.

We can steal the idea of page translation tables from the processor. Something like this:

image

Effectively, we’ll lazy allocate segments of pages and pull them together into a hierarchy. So finding out the physical address of id 84 would involve looking at the root, finding the next page down with mod operation, and so forth until we find the right value and check there. This has the advantage of being simple, O(1) and obvious. It is also pretty good in terms of space saving, since the virtual id can be “stored” without taking any space (it is the position of the physical id in the “array” we created.

This has one drawback, there is no way to recover space. Because the indexer into this data structure is meaningful, we can’t just compact things. Once space is allocated, that is it.  Now, to be fair, the cost in size here for all 100 million edges is about 0.75 GB, so not meaningful in the long run, but if we have a busy database that always write and delete, we have no way to recover the space.

The “proper” answer, by the way, is to implement an external hash table. That has the property of O(1), can grow and shrink as the amount of data changes. I’m not presenting it here mostly because it is something that we haven’t yet had the need to implement in Voron, so it isn’t something that I can just show and move on. Beside, it is fun to explore all the wrong ways of doing something.

time to read 3 min | 542 words

A customer reported that under memory constrained system, a certain operation is taking all the memory and swapping hard. On a machine with just a bit more memory, the operation completed very quickly. It didn’t take long to figure out what was going on, we were reading too much, and we started swapping, and everything went to hell after that.

The problem is that we have code that is there specifically to prevent that, it is there to check that the size that we load from the disk isn’t too big, and that we aren’t doing something foolish. But something broke here.

Here is a sample document, it is simple JSON (without indentation), and it isn’t terribly large:

image

The problem happens when we convert it to a .NET object:

image

Yep, when we de-serialized it, it takes close to 13 times more space than the text format.

For fun, let us take the following JSON:

image

This generates a string whose size is less than 1KB.

But when parsing it:

image

The reason, by the way? It is the structure of the document.

The reason, by the way:

image

So each two bytes for object creation in JSON ( the {} ) are holding, we are allocating 116 bytes. No wonder this blows up so quickly.

This behavior is utterly dependent on the structure of the document, by the way, and is very hard to protect against, because you don’t really have a way of seeing how much you allocated.

We resolved it by not only watching the size of the documents that we are reading, but the amount of free memory available on the machine (aborting if it gets too low), but that is a really awkward way of doing that.  I’m pretty sure that this is also something that you can use to attack a server, forcing it to allocate a lot of memory by sending very little data to it.

I opened an issue on the CoreCLR about this, and we’ll see if there is something that can be done.

In RavenDB 4.0, we resolved that entirely by using the blittable format, and we have one-to-one mapping between the size of the document on disk and the allocated size (actually, since we map, there is not even allocation of the data, we just access it directly Smile).

time to read 2 min | 247 words

I talked about high level and low level data operations. So far, all we have imageseen are very low level operations (get node, get edges for, etc).

Let us see how we’ll deal with a bigger challenge. In this case, we want to implement a classic graph operation, doing a depth first search, filtering by both nodes and edges.

Here is how we can implement this:

In the real world, we’ll need quite a bit more. On each node (and edge) we’ll need to decide if to return it from the query, or just traverse through it, etc. And that is just to start with.

But I think this demonstrate the point of how to layer behavior on top of the lower level database operations.

There is one thing that we need to talk about still, this code will actually use a lot of individual transactions, one for each independent operation. That is quite expensive, we can open a single transaction and pass it to the functions we call, so there is just a single cost for the entire duration of the operation.

Other things we can do is to explicitly designate specific scenarios as important and change the design so we can answer them very quickly (as in the O(1) cost for accessing nodes/edge data).

time to read 2 min | 396 words

image

I  keep calling this a toy database, and it is true for more reasons than the code existing mostly as unconnected snippets. When defining the storage layer, I breezed through quite a lot of stuff because they didn’t really matter for the point I was trying to make.

We’ll start with talking about node IDs. When we save a node to the database, we get an int64 ID back. What is that? We know that it gives us an O(1) access time to the node (or the edge data), but that’s about it. Typically, we don’t expose the internal table ID in this manner. Because the ID we get back from the Insert corresponds to the location on the disk where that data exists. So the node ID is something that is arbitrary and doesn’t correlate to anything. It isn’t a sequential number, or something that the user defines, or anything like that.

That can lead to issues. In particular, if you want to look up a node by some property, you need to have a way to do so that will retrieve its ID, and only then you can do graph operations from there. The common solution is to use a Lucene index for searching by the properties and finding the root node(s) and proceed from there.

But what about deletes? Deleting a node is possible, and when you do that, the space that was reserved for that node will be freed, and can be reused, so you’ll have a different node with the same ID. This leads to some awkwardness (you can see that with the Lucene document IDs, which have the same problem, although for very different reasons).

Updates also pose a problem, because if you need to extend the size of the node, it might be relocated, which changes the ID. Deleting is a challenge, because you need to remove the deleted node ID from all the edges that reference it, instead of cleaning it up on the fly.

This leads us to either abstract the physical ID with a virtual one (and pay the O(logN) cost for resolving it) or find a way to deal with the above inside your API.

time to read 5 min | 802 words

imageSo now that we know how to store the data, in a way that allows efficient graph traversal, let’s compute some back of the envelope computations for storage costs.

Like any storage system, Voron needs to store some metadata about our data, and sometimes this can be very surprising to people.

Let’s look at each of the items that we store in turn.

  • Node data is stored in a table.
  • Edge data is stored in a table.
  • The edge itself is stored in a B+Tree containing fixed size trees.

A table does a bunch of stuff, including reserving some space on the disk, but we don’t have dynamic tables here, so those costs are relatively fixed.

The cost per item, however, depends on the size of the data. If the data size is less than 2036 bytes, then the cost of storing the data is a 4 bytes overhead. If, however, the size of the data item is higher than 2036, we round it up to 4KB section.

In other words, storing ten million nodes, which measure 1KB in size, will cost us about 40 MB  of overhead (compared to roughly 10 GB of data). However, if the size of the data is 2KB, we need to store them in a full page. The reason for this, by the way, is that we need to balance the cost of insert and the cost of update. So we only modify things on page boundary (in this case, 4KB). If the value is small, we pack multiples of them in a single page, but beyond a certain size, we just allocate dedicated pages for them, and live with a bit of free space in the end.

More interesting is the storage of the edge data, actually. A B+Tree costs a minimum of 4KB, and we have one of these per each of the edge types that we have. In practice, we don’t expect there to be a high number of edge types, and we can readily ignore this as fixed size costs. In most cases, I would be stunned to hear that there is more than a single 4KB page for all your edges types (should be enough for a hundred or so).

What isn’t fixed size is the number of fixed size tree (one per source node) and the number of entries in the fixed size trees (one per connection). The whole reason we have fixed size trees is that they allow us to make efficient use of storage by making assumptions. You can see this in their usage. A fixed size tree has an int64 as the key, and you need to specify upfront how much space you need for the values. That makes it very simple to write.

Fixed size trees actually have two forms, they can be embedded or they can be free floating. That mostly depends on their size. If they are embedded, they are stored inside the parent tree, but if they are too large, we store them in their own page. Embedded usage takes 6 bytes per fixed size tree, we have 8 bytes for the key, and the entry size itself (which in our case is also 8 bytes). So a total of 16 bytes per entry inside the fixed size tree.

What this means, in practice, is that up until the moment a node has more than 254 connections, it can be stored as embedded value. When it goes beyond that, we’ll spill over to a full page and start taking space at 4KB increments.

One thing that I didn’t mention is that we store the fixed size trees (keyed by their source node ID), inside a parent B+Tree. Here we need to track a lot more information (keys and values have different sizes, etc). The overhead cost per entry in a B+Tree is 14 bytes. Add to that the 8 bytes for the source node id, and it comes to 22 bytes per source node.

Given all of those numbers, if we had a graph with 10 million nodes and each node was connected to a 10 other nodes in average, and each node/edge was 0.5KB in size, we would have:

  • 5 GB – Nodes data – 10,000,000
  • 50 GB – Edges data – 100,000,000
  • 80 MB – overhead data for nodes & edges.
  • 1.75 GB – edges information about all nodes.

Note that in such a graph, we have 10 million nodes, but a hundred million edges. All of which can fit comfortably into RAM on a good server, and give you blazing good performance.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Challenge (75):
    01 Jul 2024 - Efficient snapshotable state
  2. Recording (14):
    19 Jun 2024 - Building a Database Engine in C# & .NET
  3. re (33):
    28 May 2024 - Secure Drop protocol
  4. Meta Blog (2):
    23 Jan 2024 - I'm a JS Developer now
  5. Production Postmortem (51):
    12 Dec 2023 - The Spawn of Denial of Service
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}