Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,520
|
Comments: 51,141
Privacy Policy · Terms
filter by tags archive
time to read 4 min | 753 words

RavenDB allows you to query your data freely and cheaply. It is one of those things that makes or breaks a database, after all.

After over a decade of working with Lucene as our backend indexing engine, we built Corax, a new querying & indexing engine that offers far better performance.

Building an indexing engine is a humongous task. It took us close to ten years from the first line of code to Corax actually shipping. But I’m really happy with the way it turned out.

Building a query engine is a big task, and we focused primarily on making the most common queries fast. The issue at hand is that RavenDB has many features, and we don’t have infinite time. So for the less common features, we typically implemented them as a straightforward port from whatever Lucene is doing.

One such feature is facets. Let’s say that I want to buy a jacket. There are way too many choices, so I can use a faceted query to help me narrow it down.  Here is what this looks like in code:


from Products
where search(Description, "suit jacket")
select facet(Brand), 
       facet(Price < 200, 
             Price between 200 and 400, 
             Price between 400 and 800,
             Price > 800)

And here is what this looks like as a website:

I mentioned that we implemented some features as a straightforward port from Lucene, right?

We did that because RavenDB offers very rich querying semantics, and we couldn’t spend the time to craft every single bit upfront. The idea was that we would get Corax out the door and be faster in most common scenarios, and at least at parity with everything else.

It works for most scenarios, but not all of them. We recently got a query similar to the one above that was slower in Corax than in Lucene. That is usually good news since we have far more optimization opportunities in Corax. Lucene (and especially our usage of it) has already been through the wringer so many times that it is really hard to eke out any more meaningful performance gains. Corax’s architecture, on the other hand, gives us many more chances to do so.

In the case of facets, the way Lucene handles that is roughly similar to this:


def brand_facet(matches: List[int]):
  facet = dict()
  for term, docsForTerm in reader.terms("Brand"):
     facet[term] = count_intersect(matches,docsForTerm)

Given the results of the query, run over all the terms for a particular field and intersect the documents for every term with the matches for the query. Lucene is able to do that efficiently because it materializes all its data into managed memory. That has costs associated with it:

  • Higher managed memory usage (and associated GC costs)
  • Slower initial queries

The benefit of this approach is that many operations are simple, which is great. Corax, on the other hand, does not materialize all its data into managed memory. It uses persistent data structures on disk (leading to reduced memory usage and faster responses on the first query).

The advantage we have with Corax is that the architecture allows us to optimize a lot more deeply. In this case, however, it turned out to be unnecessary, as we are already keeping track of all the relevant information. We just needed to re-implement faceted search in a Corax-native manner.

You can see the changes here. But here is the summary. For a dataset with 10,000,000 records, with hundreds of brands to facet on, we get:

Yes, that isn’t a mistake. Corax is so fast here that you can barely observe it 🙂.

time to read 3 min | 481 words

A customer called us about some pretty weird-looking numbers in their system:

You’ll note that the total number of entries in the index across all the nodes does not match. Notice that node C has 1 less entry than the rest of the system.

At the same time, all the indicators are green. As far as the administrator can tell, there is no issue, except for the number discrepancy. Why is it behaving in this manner?

Well, let’s zoom out a bit. What are we actually looking at here? We are looking at the state of a particular index in a single database within a cluster of machines. When examining the index, there is no apparent problem. Indexing is running properly, after all.

The actual problem was a replication issue, which prevented replication from proceeding to the third node. When looking at the index status, you can only see that the entry count is different.

When we zoom out and look at the state of the cluster, we can see this:

There are a few things that I want to point out in this scenario. The problem here is a pretty nasty one. All nodes are alive and well, they are communicating with each other, and any simple health check you run will give good results.

However, there is a problem that prevents replication from properly flowing to node C. The actual details aren’t relevant (a bug that we fixed, to tell the complete story). The most important aspect is how RavenDB behaves in such a scenario.

The cluster detected this as a problem, marked the node as problematic, and raised the appropriate alerts. As a result of this, clients would automatically be turned away from node C and use only the healthy nodes.

From the customer’s perspective, the issue was never user-visible since the cluster isolated the problematic node. I had a hand in the design of this, and I wrote some of the relevant code. And I’m still looking at these screenshots with a big sense of accomplishment.

This stuff isn’t easy or simple. But to an outside observer, the problem started from: why am I looking at funny numbers in the index state in the admin panel? And not at: why am I serving the wrong data to my users.

The design of RavenDB is inherently paranoid. We go to a lot of trouble to ensure that even if you run into problems, even if you encounter outright bugs (as in this case), the system as a whole would know how to deal with them and either recover or work around the issue.

As you can see, live in production, it actually works and does the Right Thing for you. Thus, I can end this post by saying that this behavior makes me truly happy.

time to read 4 min | 618 words

We recently got a support request from a user in which they had the following issue:


We have an index that is using way too much disk space. We don’t need to search the entire dataset, just the most recent documents. Can we do something like this?


from d in docs.Events
where d.CreationDate >= DateTime.UtcNow.AddMonths(-3)
select new { d.CreationDate, d.Content };

The idea is that only documents from the past 3 months would be indexed, while older documents would be purged from the index but still retained.

The actual problem is that this is a full-text search index, and the actual data size required to perform a full-text search across the entire dataset is higher than just storing the documents (which can be easily compressed).

This is a great example of an XY problem. The request was to allow access to the current date during the indexing process so the index could filter out old documents. However, that is actually something that we explicitly prevent. The problem is that the current date isn’t really meaningful when we talk about indexing. The indexing time isn’t really relevant for filtering or operations, since it has no association with the actual data.

The date of a document and the time it was indexed are completely unrelated. I might update a document (and thus re-index it) whose CreationDate is far in the past. That would filter it out from the index. However, if we didn’t update the document, it would be retained indefinitely, since the filtering occurs only at indexing time.

Going back to the XY problem, what is the user trying to solve? They don’t want to index all data, but they do want to retain it forever. So how can we achieve this with RavenDB?

Data Archiving in RavenDB

One of the things we aim to do with RavenDB is ensure that we have a good fit for most common scenarios, and archiving is certainly one of them. In RavenDB 6.0 we added explicit support for Data Archiving.

When you save a document, all you need to do is add a metadata element: @archive-at and you are set. For example, take a look at the following document:


{
    "Name": "Wilman Kal",
    "Phone": "90-224 8888",
    "@metadata": {
        "@archive-at": "2024-11-01T12:00:00.000Z",
        "@collection": "Companies",
     }
}

This document is set to be archived on Nov 1st, 2024. What does that mean?

From that day on, RavenDB will automatically mark it as an archived document, meaning it will be stored in a compressed format and excluded from indexing by default.

In fact, this exact scenario is detailed in the documentation.

You can decide (on a per-index basis) whether to include archived documents in the index. This gives you a very high level of flexibility without requiring much manual effort.

In short, for this scenario, you can simply tell RavenDB when to archive the document and let RavenDB handle the rest. RavenDB will do the right thing for you.

time to read 4 min | 765 words

I’m currently deep in the process of modifying the internals of Voron, trying to eke out more performance out of the system. I’m making great progress, but I’m also touching parts of the code that haven’t even been looked at for a long time.

In other words, I’m mucking about with the most stable and most critical portions of the storage engine. It’s a lot of fun, and I’m actually seeing some great results, but it is also nerve-wracking.

We have enough tests that I’ve great confidence I would catch any actual stability issues, but the drive back toward a fully green build has been a slog.

The process is straightforward:

  • Change something.
  • Verify that it works better than before.
  • Run the entire test suite (upward of 30K tests) to see if there are any breaks.

The last part can be frustrating because it takes a while to run this sort of test suite. That would be bad enough, but some of the changes I made were things like marking a piece of memory that used to be read/write as read-only. Now any access to that memory would result in an access violation.

I fixed those in the code, of course, but we have a lot of tests, including some tests that intentionally corrupt data to verify that RavenDB behaves properly under those conditions.

One such test writes garbage to the RavenDB file, using read-write memory. The idea is to verify that the checksum matches on read and abort early. Because that test directly modifies what is now read-only memory, it generates a crash due to a memory access violation. That doesn’t just result in a test failure, it takes the whole process down.

I’ve gotten pretty good at debugging those sorts of issues (--blame-crash is fantastic) and was able to knock quite a few of them down and get them fixed.

And then there was this test, which uses encryption-at-rest. That test started to fail after my changes, and I was pretty confused about exactly what was going on. When trying to read data from disk, it would follow up a pointer to an invalid location. That is not supposed to happen, obviously.

Looks like I have a little data corruption issue on my hands. The problem is that this shouldn’t be possible. Remember how we validate the checksum on read? When using encryption-at-rest, we are using a mechanism called AEAD (Authenticated Encryption with Associated Data). That means that in order to successfully decrypt a page of data from disk, it must have been cryptographically verified to be valid.

My test results showed, pretty conclusively, that I was generating valid data and then encrypting it. The next stage was to decrypt the data (verifying that it was valid), at which point I ended up with complete garbage.

RavenDB trusts that since the data was properly decrypted, it is valid and tries to use it. Because the data is garbage, that leads to… excitement. Once I realized what was going on, I was really confused. I’m pretty sure that I didn’t break 256-bit encryption, but I had a very clear chain of steps that led to valid data being decrypted (successfully!) to garbage.

It was also quite frustrating to track because any small-stage test that I wrote would return the expected results. It was only when I ran the entire system and stressed it that I got this weird scenario.

I started practicing for my Fields medal acceptance speech while digging deeper. Something here had to be wrong. It took me a while to figure out what was going on, but eventually, I tracked it down to registering to the TransactionCommit event when we open a new file.

The idea is that when we commit the transaction, we’ll encrypt all the data buffers and then write them to the file. We register for an event to handle that, and we used to do that on a per-file basis. My changes, among other things, moved that logic to apply globally.

As long as we were writing to a single file, everything just worked. When we had enough workload to need a second file, we would encrypt the data twice and then write it to the file. Upon decryption, we would successfully decrypt the data but would end up with still encrypted data (looking like random fluff).

The fix was simply moving the event registration to the transaction level, not the file level. I committed my changes and went back to the unexciting life of bug-fixing, rather than encryption-breaking and math-defying hacks.