Ayende @ Rahien

filter by tags archive

architecture (612) rss
bugs (451) rss
challanges (123) rss
community (379) rss
databases (481) rss
design (895) rss
development (642) rss
hibernating-practices (71) rss
miscellaneous (592) rss
performance (397) rss
programming (1085) rss
raven (1449) rss
ravendb.net (533) rss
reviews (184) rss

2025
- June (6)
- May (10)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

Jun 07 2016

The Guts n’ Glory of Database Internals: Persisting information

time to read 6 min | 1099 words

Tweet Share Share 12 comments

Tags:

Reading the comments in this Reddit thread gave me a bit of a pause. Mostly because of the level of knowledge about what I consider basic stuff. Then again, I can’t do any UI that doesn’t involve the <table/> tag, so I guess that make sense. So I decided to do something about this lack of knowledge. In this series, I’m going to go over just those kind of details that I consider essential to understand how databases actually work.

Imagine that you had a program that needed to persist some state. Let us say that we need to keep track of users and how often they log into the system. You would typically use your database of choice to do so, and you’ll be flying away in no time.

But for this series, we are assuming that no such software exist, so we have to roll our own persistence solution. The simplest solution for this that is to simply store the data as a CSV file. Here is what we end up doing:

Now, this has several advantages, simplicity being a key one, but it also suffer from pretty much every possible disadvantage.

Let us consider the following simple scenarios:

I want to update the last login date
I want to update the user name
I want to search by user name
I want to get records sorted by last login date
I want to add a record
I want to delete a record

Oh, and I want to do all of the above when I have a million records.

CSV is a great thing, if the number of records we have is small. To be fair, on modern hardware, a million records is small, but we’ll ignore this for reasons that will be explained in the following posts.

Let us look at how the file actually looks like, shall we?

Now, if we want to update the last login date for Oren, that is pretty easy. We can do that using the following sequence of operations:

I’m intentionally using C API here, because that is what actually is going on, and we need to understand and work with that.

Notice a few things that are going on in here. We open the file, seek to the date’s position on the first data row, and then we update the date with a value that has the same size. Finally, we close the file.

Are mission has met success! Let all retire somewhere for breakfast (I’m writing this at 5:20 AM after a “night” that had no darkness while being in NDC Oslo. Looking out the window I see clear skies, great visibility and absolutely no people. It feels like the opening scene in a Zombie movie.

Anyway, before breakfast can commence, let us try to update the user name. Arava, while being an awesome German Shepherdess, has chosen a bad username. No matter what I call her when she chew a tennis ball in the bed again. Let us see what we’ll need to update her username to the more neutral “dog”.

Now, you might notice that I’m somewhat cheating here. The problem is that the size of the values are different, but there is no real way for us to just cut some part of the file, so I’m abusing the fact that the CSV parser will ignore any whitespace at the beginning of the value to “trim” the value.

But what happen if I didn’t want to shorten the username? What if I wanted the username to be shepherdess? Well, in this case, I wouldn’t be able to do that. I don’t have enough space, and if I tried, I would overwrite the next record, and probably corrupt the whole file.

Bummer.

Typically, at this point we have to abandon the simplicity of a CSV file and move over to a slightly better format. In this case, we can go with fixed size records. In other words, we’ll define the maximum size for each field. Here is what this looks like:

This format is harder for humans to generate, because if we do this manually we have to count spaces, etc. But this is actually much easier for us to work with. We have the following format:

Full Name – Max 16 chars
User Name – Max 12 chars
Last Login – Exactly 19 chars

Computing where to write the last login date for a particular record is now simply the following math:

(16 + 12 + 19 + 2) * NumberOfRecords + 16 + 12 + 42

42 is the length in bytes of the two first rows. And the +2 is for the line terminator.

So far so good, now we have a file format, and the ability to work with it. Which is great, and even quite convenient for us. But it isn’t really suitable for doing anything but full scans of the file. Here is the logic for searching a record by username:

By the way, I’m doing this in C also because it is fun, it has been a while since I did C, so it might be wrong.

As you can see, the cost of actually doing any sort of operation in this format is simple, O(N). That isn’t going to work for us.

In the next post in this series, I’m going to talk about indexes, and how they work. After that, we will start talking about the actual on disk format, and what it is impacted by.

Jun 06 2016

The high level interview question

time to read 3 min | 410 words

Tweet Share Share 24 comments

Tags:

The following is likely to end up in the list of questions we’ll ask candidates to answer when they apply to Hibernating Rhinos.

Imagine a sharded database. A sharded database is one where the data is split among multiple nodes. To make things simple, we will assume that each datum in the database has a 64 bits key associated with it, and we are interested in distributing the information evenly among the nodes. This can be done using Jump Consistent Hashing (see paper for details), and can be implemented using the following simple C# function:

This function is responsible for taking a key and (given how many nodes there are in the cluster) provide which node this key resides on.

So far, so good, and this make quite a lot of things much simpler. This function ensures that roughly 1/N of the data items in the databases will require movement to a new node when it is introduced. Which is pretty much exactly what we want in a sharded environment. However, this function doesn’t help us figure out what to move.

Assume that we already have a database that has 5 nodes, and ten billion data items spread across all 5 nodes, spread according to the consistent jump function. Because of load, we need to add additional 2 nodes to the cluster, and we need to move 2/7 (2.8 billion data items) of the cluster data to the new nodes. However, since moving the data items alone is going to be costly, we want to avoid scanning through all 10 billion items in all nodes to figure out which ones we need to ship to the new nodes, and which ones should remain with the current node.

Suggest a way that will allow the database to find out which data items need to be moved to the new nodes, without having to scan through all of them. In other words, anything that requires O(number of items in each node) is out.

You are rated on the impact of your suggestion on the overall cluster behavior. The cheaper your option, the better. You are free to store additional information (at cluster creation / modification, as data items are entered into the cluster / deleted, etc) if this can help you, but be aware that any impact on standard operations (reads & writes) should be minimal and well justified.

You only need to consider adding nodes to the cluster, removing nodes from the cluster is not required.

Jun 02 2016

Proposed solution to the low level interview question

time to read 3 min | 578 words

Tweet Share Share 2 comments

Tags:

For the actual question, see the original post.

So the first thing that we need to decide is what will be the data format on the tire. Since we have only 32KB to work with, we need to consider the actual storage.

32KB is small enough to fit in a unsigned short, so all the references we’ll used will be shorts. We also need to store a bit of metadata, so we’ll use the first 4 bytes as the header for just that.

ushort SpaceUsed;
ushort LastAllocation;

Now that we have this, we need to decide how to store the actual data. To make things easy, we are going to define the following way to allocate memory:

This is about the simplest way that you can go about doing things, note that we use a length prefix value, and we limit allocations to a max of 127 bytes each. We use a negative size to indicate a delete marker.

So basically, now we have a pretty trivial way to allocate memory, and we can implement the trie as we would normally do. There are a few wrinkles, however.

Deleting the memory doesn’t actually make it eligible for reuse, and it is quite likely to get fragmented easily. In order to handle that, we will track the amount of space that is used, and if we got to the end of the space, we’ll check the UsedSpace value. If this is still too little, we can abort, there is no available space here. However, if we go to the end of the buffer, but we have free space available, we can do the following:

Scan the buffer for available spots (find available locations that have negative size).
Failing that, we will copy the data to a temporary buffer, then re-add everything to the buffer from scratch. In other words, we defrag it.

Another issue we have is that the maximum size we can allocate is 127. This value is big enough so most actual strings can fit into it nicely, but a trie already has the property that a large string might be broken into pieces, we’ll just cut each node in the trie to a max size of 127. Actually, the max size is likely to be less than that, because there is also some information that we need to keep track per entry.

byte NumberOfChildren;
byte Flags; // node type, (internal, leaf or both)
ushort ChildrenPosition;

So in practice we have about 123 bytes to work with for the length. Note that we don’t store the string value of the node’s length (we can get that from the allocation information), and that we store the actual children in an array that is stored separately. This allows us to easily add items to the trie as child nodes. If the node is a leaf node, we also need to store the actual value (which is 8 bytes), we store that information at the end of the value (giving us 115 bytes for that section of the value).

All in all, there is going to be a bit of pointer arithmetic and bit counting, but is likely to be a pretty simple implementation.

Note that additional optimizations would be to try align everything so it would fit into a cache line, trying to place nodes near their children (which are more likely to be followed), etc.

Jun 01 2016

RavenDB Conference 2016–Slides

time to read 1 min | 145 words

Tweet Share Share 1 comments

Tags:

The slides from the RavenDB Conference are now available, we’ll post videos in a few weeks, once we are done post processing them.

Day 1:

Day 2:

May 31 2016

How to waste CPU and kill your disk by scaling 100 million inefficiently

time to read 5 min | 989 words

Tweet Share Share 22 comments

Tags:

I recently run into this blog post Scaling to 100M: MySQL is a Better NoSQL (from about 6 months ago) and cringed, hard. Go ahead and read it, I’ll wait. There are so much stuff going on here that I disagree with that I barely even know where to start.

I think that what annoys me the most about this post is that it attempts to explain a decision, but does that in a way that clearly shows a lack of depth in the decision making process.

I absolutely agree on the first section, you shouldn’t make your database choice based on hype, or by whatever it is “everyone” is doing. But saying that “if everyone jumps off the roof…” is generally a bad argument to make when literally everyone jumps off the roof (maybe it is on fire, maybe it is 1 meter drop, maybe it has a pool to jump into, etc). If this sounds ridiculous, this is because it is.

In particular, I take offense at:

This post will explain why we’ve found that using MySQL for the key/value use case is better than most of the dedicated NoSQL engines, and provide guidelines to follow when using MySQL in this way.

Then they go to list some of their requirements. I’m assuming that you read the post, so I’ll answer it directly.

The dataset they are talking about is about 210GB, and is composed of about 100 million records. In other words, you can fit that entire thing to memory in an AWS instance such as d2.8xlarge, at a cost of about 1.5$ / hour for a 3 year plan. Read this again, their dataset can actually fit in memory.

And even with that, they report a rate of 200K request per minute, which is funny, because the typical metric is looking at requests per second. At which point we are talking about around 3,400 req/second. But they have three database servers, so we are probably talking about around a thousand requests per second overall.

Oh, and they report an average of 1 – 1.5 ms latency numbers. Leaving aside the fact that averages means nothing (a percentiles summary would work much better), that is a really long time to process a single request.

I really liked this one:

Our existing system has scaling / throughput / concurrency / latency figures that are impressive for any NoSQL engine.

No, it isn’t. Just to give you some idea, assuming even distribution of the data, each site entry is about 2KB in size, so their throughput numbers are less than 10 MB / second.

Now, let us talk about the ways that their approach is actually broken. To start with, they have statements such as this one:

Serial keys impose locks… …Also notice that we are not using serial keys; instead, we are using varchar(50), which stores client-generated GUID values—more about that in the next section.

I mean, okay, so you have no idea how to generate serial keys without requiring locks, because things like that are so hard. I can think of several ways without even trying hard (Snowflake, HiLo, ranges, guid.comb, to name just a few). Now, why would you want want to take the time to do something like this? Because using a GUID is… how shall we say it, a horrible idea!

GUIDs are not sorted, which means that you are inserting (at a high rate) a lot of entries to the table, which forces a lot of page splits, which results in a bigger and deeper B+Tree, which result in a higher cost to find records, which is what you were trying to prevent in the first place.

Allowing sequential inserts can improve your insert performance (and afterward, the query speed) by orders of magnitude. So that is most certainly something that you really want to invest the 30 minutes it takes to code a sequential number solution from scratch, if you can use the literally dozens of ready made solutions.

But the thing that is really takes the cake is the fact that all of their queries take the following form:

So a sub-select is required to run this query (which with most reasonable query optimizers will be exactly equivalent to the query plan of an inner join), but the usage of TEXT data in the site information will mean at least another disk seek (to load the actual value) after the relevant row was located.

Now, it is possible that MySQL was a good decision for their use case, but this is:

Not an optimal usage of MySQL in the first place.
Small data set, can fit on one machine, can actually fit into memory
Inflexible system, very hard to change (needing another queryable field is now a major operation)
Low overall performance

That last one is very important. Just to give you some idea, for the size that they are talking about, we can probably handle the full 200,000 request per minute that they are talking about on their three way cluster using a single machine, and doing that in one second.

Assuming that I’m trying to find a dedicated solution to the problem (trie for the routing, simple memory mapped storage for the actual site data, where the routing trie will contain the position of the data). Of course, you would be crazy to do that. Just having a speedy solution for this is not enough, you also need to handle all of the rest of the associated costs of a database (operations, metrics, backup/restore, replication, etc).

But the provided solution is just Not Good.

May 30 2016

The worker pattern

time to read 2 min | 271 words

Tweet Share Share 2 comments

Tags:

In RavenDB 4.0, we are writing a lot of code that need to do something, then react to something, then do something, etc.

For example, an index need to index documents until it runs out, then it wait for more documents, and when they arrive, it index them, and then wait, etc.

Here are two such examples (note that the code is written just to demonstrate a point):

In the first example, we see how we handle indexing. The outer loop runs as long as we the database runs, and then we index until we run out of stuff to do. When we run out, we’ll wait. During that time, the thread is paused, and unless a new document comes for the collections that this index covers, there is nothing that needs to be done.

In the other case, we are actually handling a web socket connection, so there are some differences, but for the most part, this is pretty much the same. We use an async event, and we need to keep the connection alive, so if we have nothing to do, we’ll wake up every 5 seconds and just write a new line to the socket, keeping it alive.

Nitpicker corner: this isn’t new, and it is about the simplest and most obvious concurrency strategy you can have.

But because it is the simplest, it also have some major advantages. The code does very little actual concurrency. We can reason about the code quite easily. We keep having to dumb down our code to similar patterns, because this is much easier to maintain over the long run.

May 27 2016

The low level interview question

time to read 2 min | 234 words

Tweet Share Share 9 comments

Tags:

challanges

The following is likely to end up in the list of questions we’ll ask candidates to answer when they apply to Hibernating Rhinos.

We need to store information about keys and their location on disk. We need to implement a trie. We want the trie to store int64 size values and unbounded UTF8 string keys.

Given that, we have the following interface:

We need to implement that with the following properties:

The trie will be represented by single 32 kilobytes byte array.

You cannot store any additional information about the trie outside of the array.

The costs of searching in the trie in proportional to the size of the key.
There are no duplicates.
This is a single thread implementation.
If the array is full, it is fine to return false from the TryWrite()
Unsafe code is fine (but not required).
You cannot use any in memory structure other than the byte array. But it is fine to allocate memory during the processing of the operations (for example, to turn the string key into a byte array).

We will be looking at the following aspect in the implementation:

Correctness
Performance
Space used

The idea is that we can pack as much as possible into as small a space as possible, but at the same time, we can get great performance.

May 26 2016

The design of RavenDB 4.0The client side

time to read 5 min | 926 words

Tweet Share Share 0 comments

Tags:

We didn’t plan to do a lot of changes on the client side for RavenDB 4.0. We want to do most changes on the server side, and just adapt the client when we are doing something differently.

However, we run into an interesting problem. A while ago we asked Idan, a new guy at the company, to write a Python client for RavenDB as part of the onboarding process. Unlike with the JVM client, we didn’t want to do an exact duplication, with just the changes to match the new platform. We basically told him, go ahead and do this. And he went, and he did. But along the way he got to choose his own approach for the implementation, and he didn’t copy the same internal architecture. The end result is that the Python client is significantly simpler than the C# one.

That has a few reasons. To start with, the Python client doesn’t need to implement features such as async or Linq. But for the most part, it is because Idan was able to look at the entire system, grasp it, and then just implement the whole thing in one go.

The RavenDB network layer on the client side has gone through many changes over the years. In particular, we have the following concerns handled in this layer:

Actual sending of requests to the server.
High availability, Failover, load balancing and SLA matching.
Authentication of the requests
Caching server responses.

I’m probably ignoring a few that snuck in there, but those should be the main ones. The primary responsibility of the network layer in RavenDB is to send requests to the server (with all the concerns mentioned above) and give us the JSON object back.

The problem is that right now we have those responsibilities scattered all over the place. Each was added at a different time, and using several different ways, and the handling is done in very different locations in the code. This leads to complexity, and seeing everything in one place in the Python client is a great motivation to simplify things. So we are going to do that.

Having a single location where all of those concerned will make things simpler for us. But we can do better. Instead of just returning a JSON object, we can solve a few additional issues. Performance, stability and ease of deployment.

We can do that by removing the the internal dependency on JSON.Net. A while ago we got tired from conflicting JSON.Net versions, and we decided to just internalize it in our code. That led to much simpler deployment life, but does add some level of complexity if you want to customize your entities over the wire and in RavenDB (because you have to use our own copy of JSON.Net for that). And it complicates getting new JSON.Net updates, which we want.

So we are going to do something quite interesting here. We are going to drop the dependency on JSON.Net entirely. We already have a JSON parser, and one that is extremely efficient, using the blittable format. More importantly, it can writes directly to native code. That give us a very important property, it make it very easy to store the data, already parsed, with a well known size. Let me try to explain my reasoning.

A very important feature of RavenDB is the idea that it can cache requests for you. When you make a request to the server, the server returns the reply as well as an etag. Next time you make this request, you’ll use the same etag, and if the response hasn’t changed, the server can just tell you to use the value in the cache. Right now we are storing the full string of the request in the cache. That lead to a few issues, in particular, while we saved on the I/O of sending the request, we still need to parse the JSON, and we need to keep relatively large (sometimes very large) .NET strings around for a long time.

But if we use blittable data in the client cache, then we have no need to do additional parsing, and we don’t need to involve the GC on the client side to collect things, we can manage that explicitly and directly. So the 2nd time you make a request, we’ll hit the server, learn that the value is still relevant, and then just use it.

That is great for us. Except for one thing. JSON.Net is doing really great job in translating between JSON and .NET objects, and that isn’t really something that we want to do. Instead, we are going to handle blittable objects throughout the RavenDB codebase. We don’t actually need to deal with the user .NET’s types until the very top layers of the system. And we can handle that by having a convention, that will call into JSON.Net (the only place that will happen) and translate .NET objects to JSON and back. The nice thing about it is that since this is just a single location where this happens, we can do that dynamically, without having a hard dependency on JSON.Net version.

That, in turn, also expose the ability to do JSON <—> .Net objects using other parsers, such as Jil, for example, or whatever the end user decides.

May 25 2016

RavenDB 3.5 whirl wind tourGot anything to declare, ya smuggler?

time to read 1 min | 189 words

Tweet Share Share 0 comments

Tags:

raven

Here we have another aspect of making operations’ life easier. Supporting server-side import/export, including multiple databases, and using various options.

Leaving aside the UI bugs in the column alignment (which will be fixed long before you should see this post), there are a couple of things to note here. I have actually written about this feature before, although I do think that this is a snazzy feature.

What is more important is that we managed to get some early feedback on the released version from actual ops people and then noted that while this is very nice, what they actually want is to be able to script this. So this serves as both the UI for activating this feature, and also generating the curl script to execute it from a cron job.

As a reminder, we have the RavenDB Conference in Texas in a few months, where we’ll present RavenDB 3.5 in all its glory.

May 24 2016

The design of RavenDB 4.0Replication from server side

time to read 12 min | 2270 words

Tweet Share Share 2 comments

Tags:

Replication with RavenDB is one of our core features. Something that we had in the product from the very first release (although we did simplify things by several orders of magnitudes over the years). Replication is responsible for high availability, load balancing and several other goodies. For the most part, replication works quite well, and it is a lot less complex then some of the other things that grew over the years (LoadDocument, for example). That said, it doesn’t mean that it can’t be improved. And since this is such an important aspect of RavenDB, we spent quite a lot of time in seeing what we can do to improve it.

Here are the basic design guidelines:

RavenDB is going to remain a multi master system, where each node can accept writes and distribute it to its siblings.
- We intend to use Raft for dynamic leader selection, but that is a layer on top of the basic replication.
- That means that RavenDB is an AP system, and needs to handle conflicts.
We mostly deal with fully connected graphs of relatively small clusters (less than 10 nodes).
- Higher number of nodes are quite frequent, but they don’t use a mesh topology, but typically go for a hierarchy.

This post is going to focus solely on the server side aspects of replication, I’ll do another post about changes from the clients perspective.

Probably the first thing that we intend to change is how we track the replication history. Currently, we track the last 50 changes made on a document. This has several problems:

if there have been more than 50 changes on a document between replication batches, we’ll get a false conflict.
if the documents are small, in many cases the replication metadata is actually bigger than the document itself.

We are going to move to an explicit vector clock implementation. This is a bit complex, because there are multiple concepts that we need to track concurrently here.

Every time that a document changes, the server generate an etag for that change. This etag is an int64 number that is always increasing. This is used for optimistic concurrently, indexing, etc. The etag value is per server, and cannot be used across servers. Each server has a unique identifier. Joining the two together, whenever a document is changed on a server directly (not via replication), we’ll stamp it with the server id and the current document etag.

In other words, let us imagine the following set of operations happening in a three node cluster.

Users/1 is created on Server A, it gets an etag of 1 and a vector clock of {A:1}. Users/2 is created on Server A, it gets an etag of 2 and a vector clock of {A:2}. Users/3 is created on Server C, it gets etag 1 (because etags are local per server) and its vector clock is {C:1}. Servers A and C both replicate to Server B, and to each other, resulting in the following cluster wide setup:

	Server A	Server B	Server C
Users/1	etag 1, {A:1}	etag 1, {A:1}	etag 2, {A:1}
Users/2	etag 2, {A:2}	etag 3, {A:2}	etag 3, {A:2}
Users/3	etag 3, {C:1}	etag 2, {C:1}	etag 1, {C:1}

Note that the etags assigned for each document are not consistent across the different servers, but that they are temporally consistent with respect to the writes. In other words, Users/1 will always have a lower etag than Users/2.

Now, when we modify Users/3 on server B, we’ll get the following cluster wide picture:

	Server A	Server B	Server C
Users/1	etag 1, {A:1}	etag 1, {A:1}	etag 2, {A:1}
Users/2	etag 2, {A:2}	etag 3, {A:2}	etag 3, {A:2}
Users/3	etag 4, {B:4,C:1}	etag 4, {B:4,C:1}	etag 4, {B:4,C:1}

As I said, only changed on the server directly (and not via replication) will impact the document vector clock, but any modification (replication or directly on the node) will modify a document’s etag.

Using such vectors clocks, we gain two major features. First, it is very easy to see if we have conflicting changes. {C:1, B:4} is obviously a parent of {C:4,B:6}, while {C:2,A:6} is a conflict. The other is that we can now form a very easy view of the kind of changes that we have received. We do that using a server wide vector clock. In the case of the table above, the server wide vector clock would be {A:2,B4,C:1}. In other words, it will contain the latest etag seen from each server.

We’ll get to why exactly this is important for us in a bit. For now, just accept that it does, because the next part is about how we are going to actually do the replication. In previous versions of RavenDB, we did each replication batch through a separate REST call to the remote server. This has a few disadvantages. It meant that we had to authenticate every single time, and we couldn’t make any assumptions about the state of the remote server.

In RavenDB 4.0, we intend to move replication to use pure Websockets only usage. On startup, a node will connect to all its siblings, and stay connected to them (retrying the connection on any interruption). This has the nice benefit of only doing authentication once, of course, but far more interesting from our perspective is the fact that it means that we can rely on the state of the server on the other side. TCP has a few very interesting properties here for us. In particular, it guarantee that we’ll have ordered delivery of messages. Which means that we can assume that once we sent a message to a server on a TCP connection, it either got it, or the TCP connection will return an error at some point, forcing us to reconnect.

In other words, it isn’t just authentication that I can do just once, I can also query the remote server for its state (as it regards me), and since I’m the only person that can talk as myself, and I’m the one sending the details. As long as the connection lasts, I know what the other side knows about me. Confusing, isn’t it?

But basically it means that instead of having to ask on each batch what is the last document that the destination server saw of me, I can assume that the last document that I sent was received. That lasts until the connection breaks, in which case I can need to figure out what actually arrived. This seems like a small thing, but this will actually allow me to reduce the number of roundtrips for a batch by half. There are other aspects here that are really nice, I get to piggyback on TCP’s congestion protocol, so if the remote server is slow in accepting updates, it will (eventually) reflect as a blocking write on my end. That seems like a bad thing, right? But this is actually what I want.

Each destination server in RavenDB 4.0 is going to get its own dedicated thread. This thread will manage all outgoing communication with this server. That gives us several really important behaviors. It means that we can easily account for problems by just looking at the thread responsible (hm… I see that replication to node C is consuming a lot of CPU) and it also turn the entire replication process to a pretty simple single threaded operation. Because of the blittable format, we don’t need complex prefetching strategies or sharing of memory in the replication, and a slow node will not impact any other replication behavior. That, in turn, basically mean a thread per connection (see previous discussion on the expected number of nodes being relatively small) and a very simple programming / error handling / communication model.

The replication sending logic goes something like this:

Yes, my scratch pad language is still Boo (Python, if you aren’t familiar with it), and this is meant to convey how simple that thing is. All the complexity that we currently have to deal with is out. Of course, the real code will need to have error handling, reconnection logic, etc, but that is roughly all you’ll need.

Actually, that is a lie. The problem with the code above is that it doesn’t work well with multiple servers. In other words, it is perfect for two nodes, replicating to one another, but when you have multiple nodes, you don’t want a single document update to replication from each node to every other node. That is why we have the concept of vector clocks. At the document level, this serves as an easy way to detect conflicts and see what version of a document is casually later than another version of a document. But on the server level, we gather the latest writes from all nodes that we saw to get the server wide vector clock.

When a document is modified on a server, that server will immediately send that document to all its siblings. Because there is no way that they already have it. But if a document was replicated to a node, it will not start replicating right away. Instead, it will let a set amount of time go by (defaulting to once a minute) and then ask each sibling what is the latest server wide vector clock that it is aware of. If the remote vector clock is equal to or higher than the local server wide vector clock, then we know that they are up to date. In this case, the local server will let the remote server know that they are a match to the current etag on that server.

If, however, the local vector clock is smaller (or conflicting) from the remote server, then we need to send the relevant documents. We already know what is the last etag that the remote server has from us (we negotiated that when we established the connection, and we updated it every time we sent a document to the remote server. Since we have the current vector clock from the remote server, we aren’t going to just blindly send all documents after the last etag we sent to the remote server. Instead, we are going to check each of those to see if the vector clock for the document is larger (or conflicting) than the remote server vector clock. In this way, we can send the remote server only the documents that it doesn’t have.

What about delayed servers? If we had a new node in the cluster, and we just started replicating to it, what happens when a new document is being written. Above, I mentioned that the written to server will immediately write it to all its siblings, but that is an over simplification. An extremely important property of RavenDB replication is that documents are always replicated in the order the server saw them (either written to it directly, or the order they were replicated to it). If we allow a server to replicate documents directly to another server, that might break this guarantee. Looking at the code above, it will also require us to write a separate code path to handle such things. But that is the beauty in this design. All of this logic is actually encapsulated in WaitForMoreDocuments(). You can this of WaitForMoreDocuments() as a simple manual reset event. Whenever a document is written to a document directly, it will be set. But not when a document is replicated to us.

So WaitForMoreDocuments() will basically wait for a document to be written to us, or a timeout, in which case it will check with its sibling for new stuff that need to go over the wire because it was replicated to us. But the code is the same code, and the behavior is the same. If we are busy sending data to a new server? We’ll still set the event, but that will have no effect on the actual behavior. And when we are working with a fully caught up server, the act of writing a single document will immediately free the replication threads to start sending it to the sibling. All the desired behaviors, and very little actual complexity.

On the receiving end, we get just the documents we don’t have, as well as the last etag from that source server (which we’ll keep in persistent storage). Whenever we get a new document, we’ll check if it is conflicting. If so, we’ll mark the document as conflicting and allow the user to define default strategies to handle that (latest, resolve to remote, resolve to local). But we are also going to allow the user to define a Javascript function that will merge the conflicted documents directly. This way you can have your business logic for the resolution directly on the server, and you’ll never actually see any conflicts externally.

There are quite a lot of small details that I’m skipping, but this is already long enough, and should give you a pretty good idea about where we are headed.

Oren Eini

Oren Eini

CEO of RavenDB

The Guts n’ Glory of Database Internals: Persisting information

The high level interview question

Proposed solution to the low level interview question

RavenDB Conference 2016–Slides

How to waste CPU and kill your disk by scaling 100 million inefficiently

The worker pattern

The low level interview question

The design of RavenDB 4.0The client side

RavenDB 3.5 whirl wind tourGot anything to declare, ya smuggler?

The design of RavenDB 4.0Replication from server side

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed