Ayende @ Rahien

filter by tags archive

architecture (612) rss
bugs (451) rss
challanges (123) rss
community (379) rss
databases (481) rss
design (895) rss
development (641) rss
hibernating-practices (71) rss
miscellaneous (592) rss
performance (397) rss
programming (1085) rss
raven (1449) rss
ravendb.net (533) rss
reviews (184) rss

2025
- June (5)
- May (10)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

Jul 29 2024

With bugs, failures and errors: ever chugging forward

time to read 3 min | 481 words

Tweet Share Share 2 comments

Tags:

A customer called us about some pretty weird-looking numbers in their system:

You’ll note that the total number of entries in the index across all the nodes does not match. Notice that node C has 1 less entry than the rest of the system.

At the same time, all the indicators are green. As far as the administrator can tell, there is no issue, except for the number discrepancy. Why is it behaving in this manner?

Well, let’s zoom out a bit. What are we actually looking at here? We are looking at the state of a particular index in a single database within a cluster of machines. When examining the index, there is no apparent problem. Indexing is running properly, after all.

The actual problem was a replication issue, which prevented replication from proceeding to the third node. When looking at the index status, you can only see that the entry count is different.

When we zoom out and look at the state of the cluster, we can see this:

There are a few things that I want to point out in this scenario. The problem here is a pretty nasty one. All nodes are alive and well, they are communicating with each other, and any simple health check you run will give good results.

However, there is a problem that prevents replication from properly flowing to node C. The actual details aren’t relevant (a bug that we fixed, to tell the complete story). The most important aspect is how RavenDB behaves in such a scenario.

The cluster detected this as a problem, marked the node as problematic, and raised the appropriate alerts. As a result of this, clients would automatically be turned away from node C and use only the healthy nodes.

From the customer’s perspective, the issue was never user-visible since the cluster isolated the problematic node. I had a hand in the design of this, and I wrote some of the relevant code. And I’m still looking at these screenshots with a big sense of accomplishment.

This stuff isn’t easy or simple. But to an outside observer, the problem started from: why am I looking at funny numbers in the index state in the admin panel? And not at: why am I serving the wrong data to my users.

The design of RavenDB is inherently paranoid. We go to a lot of trouble to ensure that even if you run into problems, even if you encounter outright bugs (as in this case), the system as a whole would know how to deal with them and either recover or work around the issue.

As you can see, live in production, it actually works and does the Right Thing for you. Thus, I can end this post by saying that this behavior makes me truly happy.

Jul 26 2024

Indexing only recent data - adventures with large datasets & archiving

time to read 4 min | 618 words

Tweet Share Share 1 comments

Tags:

We recently got a support request from a user in which they had the following issue:

We have an index that is using way too much disk space. We don’t need to search the entire dataset, just the most recent documents. Can we do something like this?

from d in docs.Events
where d.CreationDate >= DateTime.UtcNow.AddMonths(-3)
select new { d.CreationDate, d.Content };

The idea is that only documents from the past 3 months would be indexed, while older documents would be purged from the index but still retained.

The actual problem is that this is a full-text search index, and the actual data size required to perform a full-text search across the entire dataset is higher than just storing the documents (which can be easily compressed).

This is a great example of an XY problem. The request was to allow access to the current date during the indexing process so the index could filter out old documents. However, that is actually something that we explicitly prevent. The problem is that the current date isn’t really meaningful when we talk about indexing. The indexing time isn’t really relevant for filtering or operations, since it has no association with the actual data.

The date of a document and the time it was indexed are completely unrelated. I might update a document (and thus re-index it) whose CreationDate is far in the past. That would filter it out from the index. However, if we didn’t update the document, it would be retained indefinitely, since the filtering occurs only at indexing time.

Going back to the XY problem, what is the user trying to solve? They don’t want to index all data, but they do want to retain it forever. So how can we achieve this with RavenDB?

Data Archiving in RavenDB

One of the things we aim to do with RavenDB is ensure that we have a good fit for most common scenarios, and archiving is certainly one of them. In RavenDB 6.0 we added explicit support for Data Archiving.

When you save a document, all you need to do is add a metadata element: @archive-at and you are set. For example, take a look at the following document:

{
    "Name": "Wilman Kal",
    "Phone": "90-224 8888",
    "@metadata": {
        "@archive-at": "2024-11-01T12:00:00.000Z",
        "@collection": "Companies",
     }
}

This document is set to be archived on Nov 1st, 2024. What does that mean?

From that day on, RavenDB will automatically mark it as an archived document, meaning it will be stored in a compressed format and excluded from indexing by default.

In fact, this exact scenario is detailed in the documentation.

You can decide (on a per-index basis) whether to include archived documents in the index. This gives you a very high level of flexibility without requiring much manual effort.

In short, for this scenario, you can simply tell RavenDB when to archive the document and let RavenDB handle the rest. RavenDB will do the right thing for you.

Jul 15 2024

Temporal cattle and other important jargon

time to read 3 min | 485 words

Tweet Share Share 0 comments

Tags:

I was talking to a colleague about a particular problem we are trying to solve. He suggested that we solve the problem using a particular data structure from a recently published paper. As we were talking, he explained how this data structure works and how that should handle our problem.

The solution was complex and it took me a while to understand what it was trying to achieve and how it would fit our scenario. And then something clicked in my head and I said something like:

Oh, that is just epoch-based, copy-on-write B+Tree with single-producer/ concurrent-readers?

If this sounds like nonsense to you, it is fine. Those are very specific terms that we are using here. The point of such a discussion is that this sort of jargon serves a very important purpose. It allows us to talk with clarity and intent about fairly complex topics, knowing that both sides have the same understanding of what we are actually talking about.

The idea is that we can elevate the conversation and focus on the differences between what the jargon specifies and the topic at hand. This is abstraction at the logic level, where we can basically zoom out a lot of details and still keep high intent accuracy.

Being able to discuss something at this level is hugely important because we can convey complex ideas easily. Once I managed to put what he was suggesting in context that I could understand, we were able to discuss the pros and cons of this data structure for the scenario.

I do appreciate that the conversation basically stopped making sense to anyone who isn’t already well-versed in the topic as soon as we were able to (from my perspective) clearly and effectively communicate.

“When I use a word,’ Humpty Dumpty said in rather a scornful tone, ‘it means just what I choose it to mean — neither more nor less.”

Clarity of communication is a really important aspect of software engineering. Being able to explain, hopefully in a coherent fashion, why the software is built the way it is and why the code is structured just so can be really complex. Leaning on existing knowledge and understanding can make that a lot simpler.

There is also another aspect. When using jargon like that, it is clear when you don’t know something. You can go and research it. The mere fact that you can’t understand the text tells you both that you are missing information and where you can find it.

For software, you need to consider two scenarios. Writing code today and explaining how it works to your colleagues, and looking at code that you wrote ten years ago and trying to figure out what was going on there.

In both cases, I think that this sort of approach is a really useful way to convey information.

May 16 2024

My calendar is full 10 years from now

time to read 2 min | 326 words

Tweet Share Share 4 comments

Tags:

Take a look at this wonderful example of foresightedness (or hubris).

In a little over ten years, Let’s Encrypt root certificates are going to expire. There are already established procedures for how to handle this from other Certificate Authorities, and I assume that there will be a well-communicated plan for this in advance.

That said, I’m writing this blog post primarily because I want to put the URL in the notes for the meeting above. Because in 10 years, I’m pretty certain that I won’t be able to recall why this is such a concerning event for us.

RavenDB uses certificates for authentication, usually generated via Let’s Encrypt. Since those certificates expire every 3 months, they are continuously replaced. When we talk about trust between different RavenDB instances, that can cause a problem. If the certificate changes every 3 months, how can I trust it?

RavenDB trusts a certificate directly, as well as any later version of that certificate assuming that the leaf certificate has the same key and that they have at least one shared signer. That is to handle the scenario where you replace the intermediate certificate (you can go up to the root certificate for trust at that point).

Depending on the exact manner in which the root certificate will be replaced, we need to verify that RavenDB is properly handling this update process. This meeting is set for over a year before the due date, which should give us more than enough time to handle this.

Right now, if they are using the same key on the new root certificate, it will just work as expected. If they opt for cross-singing with another root certificate, we need to ensure that we can verify the signatures on both chains. That is hard to plan for because things change.

In short, future Oren, be sure to double-check this in time.

Apr 04 2024

Vacations, leaky abstractions, and modeling concerns, oh my!

time to read 22 min | 4283 words

Tweet Share Share 0 comments

Tags:

Our task today is to request (and obtain approval for) a vacation. But before we can make that request, we need to handle the challenge of building the vacation requesting system. Along the way, I want to focus a little bit on how to deal with some of the technical issues that may arise, such as concurrency.

In most organizations, the actual details of managing employee vacations are a mess of a truly complicated series of internal policies, labor laws, and individual contracts. For this post, I’m going to ignore all of that in favor of a much simplified workflow.

An employee may Request a Vacation, which will need to be approved by their manager. For the purpose of discussion, we’ll ignore all other aspects and set out to figure out how we can create a backend for this system.

I’m going to use a relational database as the backend for now, using the following schema. Note that this is obviously a highly simplified model, ignoring many real-world requirements. But this is sufficient to talk about the actual issue.

After looking at the table structure, let’s look at the code (again, ignoring data validation, error handling, and other rather important concerns).

app.post('/api/vacations/request', async (req, res) => {
    const { employeeId, dates, reason } = req.body;


    await pgsql.query(`BEGIN TRANSACTION;`);
    const managerId = await pgsql.query(
      `SELECT manager FROM Employees WHERE id = $1;`,
      [employeeId]).rows[0].id;
    const vacReqId = await pgsql.query(
      `INSERT INTO VacationRequests (empId,approver,reason,status)
       VALUES ($1,$2,$3,'Pending') RETURNING id;`,
       [employeeId,managerId,reason]).rows[0].id;


    for(const date of date) {
        await pgsql.query(
          `INSERT INTO VacationRequestDates
           (vacReqId, date, mandatory ,notes)
           VALUES ($1, $2, $3, $4);`, 
          [vacReqId, d.date, d.mandatory, d.notes]);
    }
     
    await pgsql.query(`COMMIT;`);


    res.status(201).json({ requestId: result.rows[0].id });
});

We create a new transaction, find who the manager for the employee is, and register a new VacationRequest for the employee with all the dates for that vacation. Pretty simple and easy, right? Let’s look at the other side of this, approving a request.

Here is how a manager is able to get the vacation dates that they need to approve for their employees.

app.get('/api/vacations/approval', async (req, res) => {
  const { whoAmI } = req.body;
 
  const vacations = await pgsql.query(
    `SELECT VRD.id, VR.empId, VR.reason, VRD.date, E.name,
           VRD.mandatory, VRD.notes
    FROM VacationRequests VR
    JOIN VacationRequestDates VRD ON VR.id = VRD.vacReqId
    JOIN Employees E ON VR.empId = E.id
    WHERE VR.approver = $1 AND VR.status = 'Pending'`,
    [whoAmI]);


  res.status(200).json({ vacations });
});

As you can see, most of the code here consists of the SQL query itself. We join the three tables to find the dates that still require approval.

I’ll stop here for a second and let you look at the two previous pieces of code for a bit. I have to say, even though I’m writing this code specifically to point out the problems, I had to force myself not to delete it. There was mental pressure behind my eyes as I wrote those lines.

The issue isn’t a problem with a lack of error handling or security. I’m explicitly ignoring that for this sort of demo code. The actual problem that bugs me so much is modeling and behavior.

Let’s look at the output of the previous snippet, returning the vacation dates that we still need to approve.

id	empId	name	reason	date
8483	391	John	birthday	2024-08-01
8484	321	Jane	dentist	2024-08-02
8484	391	John	birthday	2024-08-02

We have three separate entries that we need to approve, but notice that even though two of those vacation dates belong to the same employee (and are part of the same vacation request), they can be approved separately. In fact, it is likely that the manager will decide to approve John for the 1st of August and Jane for the 2nd, denying John’s second vacation day. However, that isn’t how it works. Since the actual approval is for the entire vacation request, approving one row in the table would approve all the related dates.

When examining the model at the row level, it doesn’t really work. The fact that the data is spread over multiple tables in the database is an immaterial issue related to the impedance mismatch between the document model and the relational model.

Let’s try and see if we can structure the query in a way that would make better sense from our perspective. Here is the new query (the rest of the code remains the same as the previous snippet).

SELECT VRD.id, VR.empId, E.name, VR.reason,
    (
        SELECT json_agg(VRD)
        FROM VacationRequestDates VRD
        WHERE VR.id = VRD.vacReqId
    ) AS dates
FROM VacationRequests VR
JOIN Employees E ON VR.empId = E.id
WHERE VR.approver = $1 AND VR.status = 'Pending'

This is a little bit more complicated, and the output it gives is quite different. If we show the data in the same way as before, it is much easier to see that there is a single vacation request and that those dates are tied together.

id	empId	name	reason	status	date
8483	391	John	birthday	Pending	2024-08-01and 2024-08-02
8484	321	Jane	dentist	Pending	2024-08-02

We are going to ignore the scenario of partial approval because it doesn’t matter for the topic I’m trying to cover. Let’s discuss two other important features that we need to handle. How do we allow an employee to edit a vacation request, and how does the manager actually approve a request.

Let’s consider editing a vacation request by the employee. On the face of it, it’s pretty simple. We show the vacation request to the employee and add the following endpoint to handle the update.

app.post('/api/vacation-request/date', async (req, res) => {
  const { id, date, mandatory, notes, vacReqId } = req.body;
 
 if(id typeof == 'number') {
  await pgsql.query(
    `UPDATE VacationRequestDates
    SET date = $1, mandatory = $2, notes = $3
    WHERE id = $4`,
    [date, mandatory, notes, id]);
 }
 else {
  await pgsql.query(
    `INSERT INTO VacationRequestDates (date, mandatory, notes, vacReqId)
    VALUES ($1, $2, $3, $4)`,
    [date, mandatory, notes, vacReqId]);
 }
 
  res.status(200);
});


app.delete('/api/vacation-request/date', async (req, res) => {
  const { id } = req.query;
 
  await pgsql.query(
    `DELETE FROM VacationRequestDates WHERE id = $1`,
    [id]);


  res.status(200);
});

Again, this sort of code is like nails on board inside my head. I’ll explain why in just a bit. For now, you can see that we actually need to handle three separate scenarios for editing an existing request date, adding a new one, or deleting it. I’m now showing the code for updating the actual vacation request (such as the reason for the request) since that is pretty similar to the above snippet.

The reason that this approach bugs me so much is because it violates transaction boundaries within the solution. Let’s assume that I want to take Thursday off instead of Wednesday and add Friday as well. How would that be executed using the current API?

I would need to send a request to update the date on one row in VacationRequestDates and another to add a new one. Each one of those operations would be its own independent transaction. That means that either one can fail. While I wanted to have both Thursday and Friday off, only the request for Friday may succeed, and the edit from Wednesday to Thursday might not.

It also means that the approver may see a partial state of things, leading to an interesting problem and eventually an exploitable loophole in the system. Consider the scenario of the approver looking at vacation requests and approving them. I can arrange things so that while they are viewing the request, the employee will add additional dates. When the approver approves the request, they’ll also approve the additional dates, unknowingly.

Let’s solve the problem with the transactional updates on the vacation request and see where that takes us:

app.post('/api/vacation-request/update', async (req, res) => {
  const { varRecId, datesUpdates } = req.body;
  await pgsql.query(`BEGIN TRANSACTION;`);


  for (const { op, id, date, mandatory, notes } of datesUpdates) {
    if (op === 'delete') {
      await pgsql.query(`DELETE FROM VacationRequestDates
        WHERE id = $1;`,
        [id]);
    }
    else if (op === 'insert') {
      await pgsql.query(`INSERT INTO VacationRequestDates
        (varRecId, date, mandatory, notes)
        VALUES ($1, $2, $3, $4);`,
        [varRecId, date, mandatory, notes]);
     
    }
    else {
      await pgsql.query(`UPDATE VacationRequestDates
        SET date = $1, mandatory = $2, notes = $3
        WHERE id = $4;`,
        [date, mandatory, notes, id]);
    }
  }


  await pgsql.query(`COMMIT;`);
  res.status(200);
});

That is… a lot of code to go through. Note that I looked into Sequelize as well to see what kind of code that would produce when using an OR/M, it wasn’t meaningfully simpler.

There is a hidden bug in the code above. But you probably won’t notice it no matter how much you’ll look into it. The issue is code that isn’t there. The API code above assumes that the caller will send us all the dates for the vacation requests, but it is easy to get into a situation where we may edit the same vacation requests from both the phone and the laptop, and get partial information.

In other words, our vacation request on the database has four dates, but I just updated three of them. The last one is part of my vacation request, but since I didn’t explicitly refer to that, the code above will ignore that. The end result is probably an inconsistent state.

In other words, to reduce the impedance mismatch between my database and the way I work with the user, I leaned too much toward exposing the database to the callers. The fact that the underlying database is storing the data in multiple tables has leaked into the way I model my user interface and the wire API. That leads to a significant amount of complexity.

Let’s go back to the drawing board. Instead of trying to model the data as a set of rows that would be visually represented as a single unit, we need to actually think about a vacation request as a single unit.

Take a look at this image, showing a vacation request form. That is how the business conceptualizes the problem: as a single cohesive unit encompassing all the necessary data for submitting and approving a vacation request.

Note that for real systems, we’ll require a lot more data, including details such as the actual vacation days taken, how they should be recorded against the employee’s leave allowance, etc.

The important aspect here is that instead of working with individual rows, we need to raise the bar and move to working with the entity as a whole. In modeling terms, this means that we won’t work with rows but with Root Aggregate (from DDD terminology).

But I already have all of this code written, so let’s see how far I can push things before I even hit my own limits. Let’s look at the code that is required to approve a vacation request. Here is the first draft I wrote to do so.

app.post('/api/vacation-request/approve', async (req, res) => {
  const { varRecId, approver, status } = req.body;


  const res = await pgsql.query(`UPDATE VacationRequests
   SET status = $1 WHERE id = $2 and approver = $3;`,
    [status, varRecId, approver]);
 
  if (res.rowCount == 0) {
    res.status(400)
      .send({ error: 'No record found or wrong approver' });
  }


  res.status(200);
});

Which will give me the vacation requests that I need to approve:

id	empId	name	reason	status	date
8483	391	John	birthday	Pending	2024-08-01 and 2024-08-02

And then I actually approve it using:

POST /api/vacation-request/approve
{"varRecId": 8483, "approver": 9341, "status": "Approved"}

What is the problem now? Well, what happens if the employee modifies the vacation request between the two requests? The approver may end up approving the wrong details. How do we fix that?

You may think that you can use locking on the approve operation, but we actually have just a single statement executed, so that doesn’t matter. And given that we have two separate requests, with distinct database transactions between them, that isn’t even possible.

What we need to implement here is called Offline Optimistic Concurrency. In other words, we need to ensure that the version the manager approved is the same as the one that is currently in the database.

In order to do that, we need to modify our schema and add a version column to the VacationRequests table, as you can see in the image.

Now, any time that I make any modification on the VacationRequest, I must also increment the value of the Version field and check that it matches my expected value.

Here is an example of how this looks like when the Employee is adding a new date to the vacation request. I shortened the code that we previously looked at to update a vacation request, so you can more clearly see the changes required to ensure that changes in the request will be detected between requests.

app.post('/api/vacation-request/insert-date', async (req, res) => {
  const { varRecId, version,  } = req.body;
  await pgsql.query(`BEGIN TRANSACTION;`);


  const res = await pgsql.query(`UPDATE VacationRequests
   SET version = version + 1
    WHERE id = $1 and version = $2;`,
    [varRecId, version]);


  if (res.rowCount == 0) {
    res.status(400)
      .send({ error: 'No record found or wrong version' });
  }


  await pgsql.query(`INSERT INTO VacationRequestDates
        (varRecId, date, mandatory, notes)
        VALUES ($1, $2, $3, $4);`,
    [varRecId, date, mandatory, notes]);


  await pgsql.query(`COMMIT;`);
  res.status(200);
});

And on the other side, approving the request is now:

app.post('/api/vacation-request/approve', async (req, res) => {
  const { varRecId, approver, version, status } = req.body;


  const res = await pgsql.query(`UPDATE VacationRequests
   SET status = $1 and version = version + 1
   WHERE id = $2 and approver = $3 and version = $4;`,
    [status, varRecId, approver, version]);
 
  if (res.rowCount == 0) {
    res.status(400)
      .send({ 
         error: 'No record found or wrong approver or version'
       });
  }


  res.status(200);
});

We need to send the version to the client when we read it, and when we approve it, we need to ensure that we send the version back, to verify that there have been no changes.

I have to say, given that I set out to do something pretty simple, I’m actually shocked at how complex this all turned out to be. The solution above also requires cooperation from all entities. If I’m ever writing some code that modifies the vacation requests or manages them manually (for maintenance purposes, debugging, etc) I need to also remember to include the version updates.

When I started writing this blog post, I intended to also show you how you can model the same situation differently. But I think that this is quite long enough already, and I’ll complete the proper modeling concerns in the next post.

Mar 28 2024

RavenDB’s storage engine: Voron–unlocking the secret

time to read 1 min | 103 words

Tweet Share Share 0 comments

Tags:

A couple of months ago I had the joy of giving an internal lecture to our developer group about Voron, RavenDB’s dedicated storage engine. In the lecture, I’m going over the design and implementation of our storage engine.

If you ever had an interest on how RavenDB’s transactional and high performance storage works, that is the lecture for you. Note that this is aimed at our developers, so we are going deep.

You can find the slides here and here is the full video.

Mar 27 2024

Certificates from the Ground Up

time to read 1 min | 99 words

Tweet Share Share 0 comments

Tags:

One of the most fun things that I do at work is share knowledge about how various things work. A few months ago I talked internally about how Certificates work. Instead of just describing the mechanism of that, I decided to actually walk our developers through the process of building the certificate infrastructure from scratch.

You can find the slides here and the full video is available online, it’s just over an hour of both lecture and discussion.

Mar 05 2024

RecordingTechnology & Friends - Oren Eini on the Corax Search Engine

time to read 1 min | 101 words

Tweet Share Share 0 comments

Tags:

When Oren Eini originally developed RavenDB, he used the Lucene library to implement indexing. Eventually, his team encountered limitations with this strategy, so they created the Corax search engine, which improved query execution time significantly. Oren discusses the challenges involved in creating this engine and the approaches they took to overcome these challenges.

Part 1:

Part 2:

Mar 04 2024

RavenDB Cloud Global Status vs. Product Status

time to read 2 min | 399 words

Tweet Share Share 2 comments

Tags:

One of the interesting components of RavenDB Cloud is status reporting. It turns out that when you offer X as a Service, people really care about your operational status.

For RavenDB Cloud, we have https://status.ravendb.net/, which will give you some insights into the overall health of the system. Here are some details from the status page:

The interesting thing about this page is that it shows global status, indicating issues affecting large swaths of users. For instance, Azure having issues in a whole region in the image above is a great example of one such scenario. Regular maintenance, which we carry over the span of days, is something that we report, but you’ll usually never notice (due to the High Availability features of RavenDB).

It gets more complicated when we start talking about individual instances. There are many scenarios where the overall system health is great, but a particular database may suffer. The easiest example is if you run out of disk space. That affects that particular instance only.

For that scenario, we are reporting Production Monitoring Alerts within the RavenDB Cloud portal. Here is what this looks like:

As you can see, we report specific problems on those instances, raising that to your awareness. That was actually needed because, for the most part, RavenDB itself handles those sorts of things via High Availability, which means that even if there are issues, you’re likely to not feel them for a while.

Resilience at the cluster level means that even pretty severe problems are papered over and the system moves on. But there is only so much limping that you can do. If you are running at the bare edge of capacity, eventually you’ll trip over the line.

Those Production Monitoring Alerts allow you to detect and act upon those issues when they happen, not when they bring down production.

This aligns with our vision for RavenDB, the kind of system where you don’t need to have a full-time babysitter monitoring the system. Instead, if there is a problem that the database cannot solve on its own, it will explicitly notify you, in advance.

That leads to a system that is far healthier all around and means that you can focus on building your system, rather than managing database minutiae.

Jan 24 2024

Corax, Lucene, Benchmarks and lies!

time to read 6 min | 1025 words

Tweet Share Share 2 comments

Tags:

When we started working on Corax (10 years ago!), we had a pretty simple mission statement for that: “Lucene, but 10 times faster for our use case”. When we actually started implementing this in code (early 2020), we had a few more rules about the direction we wanted to take.

Corax had to be faster than Lucene in all scenarios, and 10 times faster for common indexing and querying scenarios. Corax design is meant for online indexing, not batch-oriented like Lucene. We favor moving work to indexing time and ensuring that our data structures on disk can work with no additional processing time.

Lucene was created at a time when data size was much smaller and disks were far more expensive. It shows in the overall design in many ways, but one of the critical aspects is that the file design for Lucene is compressed, meaning that you need to read the data, decode that into the in-memory data structure, and then process it.

For RavenDB’s use case, that turned out to be a serious problem. In particular, the issue of cold queries, where you query the database for the first time and have to pay the initialization cost, was particularly difficult. Now, cold queries aren’t really that interesting, from a benchmark perspective, you have to warm things up in every software (caches are everywhere, from your disk to your CPU). I like to say that even memory has caches (yes, plural) because it is so slow (L1, L2, L3 caches).

With Lucene’s design, however, whenever it runs an indexing batch, it creates a new file, and to start querying after that means that you have a “cold start” for that file. Usually, those files are small, but every now and then Lucene needs to merge several files together and then we have to pay the cold start price for a large amount of data.

The issue is that this sometimes introduces a high latency spike (hitting us in the P999 targets), which is really hard to smooth over. We spent a lot of time and engineering resources ensuring that this doesn’t have a big impact on our users.

One of the design goals for Corax was to ensure that this doesn’t happen. That we are able to get consistent performance from the system without periodic maintenance tasks. That led us to a very different internal design. The persistent data structures that we use are meant to be used as is, without initial processing.

Everything has a cost, and in this case, it means that the size of Corax on disk is typically somewhat larger than Lucene. The big advantage is that the amount of memory being used by Corax tends to be significantly lower. And in today’s world, disks are far cheaper than memory. Corax’s cold start time is orders of magnitude faster than Lucene’s cold start time.

It turns out that there is a huge impact in another scenario as well, completely unexpected. We continuously run performance tests on our system, and we got some ridiculous results when testing query performance using encrypted databases.

When you use encryption at rest, RavenDB ensures that the only time that your data is decrypted is when there is an active transaction using the data. In other words, even in-memory buffers are encrypted. That applies to documents as well as indexes. It does not apply to the in-memory data that Lucene holds in its cache, though. For Corax, however, all of its state is encrypted.

When we run our benchmark on encrypted database queries, we expect to see either roughly the same performance between Corax and Lucene or see Lucene edging out Corax in this scenario, since it can use its cache without paying decryption costs.

Instead, we got really puzzling results. I tried showing them in bar chart format, but I literally couldn’t make the data fit in a reasonable size. The scenario is testing queries on an encrypted database, using an m5.xlarge instance on AWS. We are hitting the server with 500 queries/second, and testing for the 99.99 percentile performance.

Indexing Engine	99.99% percentile (ms)	99.99% percentile (seconds)
Lucene	40,210	40.21
Corax	186	0.18

Take a look at those numbers! Somehow Corax is absolutely smoking Lucene’s lunch. And I was quite surprised about that. I mean, I’m happy, I guess, that the indexing engine we spent so much time on is doing this well, but any time that we see a performance number that we cannot explain we need to figure out what is going on.

Here is the profiler output for this benchmark, using Lucene.

As you can see, the vast majority of the time is spent decrypting pages. And we are decrypting pages belonging to a stream. Those are the Lucene files, stored (encrypted in this case) inside of Voron. The issue is that the access pattern that Lucene is using forces us to touch large parts of the file. It usually reads a very small portion each time, but in various locations. Given that the data is encrypted, we have to decrypt each of those locations.

Corax, on the other hand, keeps the persistent data structure in such a way that when we need to access specific pages only. That means that in terms of the number of pages touched by Corax or Lucene for this particular scenario, Lucene is using a lot more. You’ll usually not notice that since Voron (our storage engine) is memory mapped and those accesses are cheap. When using encrypted storage, however, we need to decrypt the data first, so that was very noticeable.

It’s interesting to note that this also applies to instances where there is a memory pressure involved. Corax would tend to touch a lot less memory and have a smaller working set, while Lucene will generate more page faults.

Really interesting results, and I’m both happy and amused that totally different design decisions have led to such a big impact in this scenario. In short, Corax is fast, really fast, and in many more scenarios than we initially thought.

Oren Eini

Oren Eini

CEO of RavenDB

With bugs, failures and errors: ever chugging forward

Indexing only recent data - adventures with large datasets & archiving

Data Archiving in RavenDB

Temporal cattle and other important jargon

My calendar is full 10 years from now

Vacations, leaky abstractions, and modeling concerns, oh my!

RavenDB’s storage engine: Voron–unlocking the secret

Certificates from the Ground Up

RecordingTechnology & Friends - Oren Eini on the Corax Search Engine

RavenDB Cloud Global Status vs. Product Status

Corax, Lucene, Benchmarks and lies!

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed