Ayende @ Rahien

filter by tags archive

architecture (623) rss
bugs (451) rss
community (382) rss
databases (481) rss
design (899) rss
development (654) rss
hibernating-practices (73) rss
miscellaneous (592) rss
performance (397) rss
programming (1104) rss
raven (1471) rss
ravendb.net (558) rss
reviews (184) rss

2025
- October (4)
- September (10)
- August (6)
- July (7)
- June (7)
- May (10)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

Couchbase vs RavenDB Performance at Rakuten Kobo Whitepaper

Apr 26 2013

Reviewing LevelDBPart XVIII–Summary

time to read 2 min | 349 words

Tweet Share Share 8 comments

Tags:

Well, I am very happy at the conclusion of this blog post series. Beside being one of the longest that I have done, this actually stretched my ability to count using roman numerals.

In summary, I am quite happy that I spent the time reading all of this code. The LevelDB codebase is really simple, when you grok what it actually does. There is nothing there that would totally stun a person. What there is there, however, is a lot of accumulated experience in building those sort of things.

You see this all over the place, in the format of the SST, in the way compaction is working, in the ability to add filters, write merging, etc. The leveldb codebase is a really good codebase to read, and I am very happy to have done so. Especially since doing this in C++ is way out of m comfort zone. It was also interesting to read what I believe is idiomatic C++ code.

Another observation about leveldb is that it is a hard core C++ product. You can’t really take that and use the same approach in a managed environment. In particular, efforts to port leveldb to java (https://github.com/dain/leveldb/) are going to run into hard issues with problems like managing the memory. Java, like .NET, has issues with allocating large byte arrays, and even from the brief look I took, working with leveldb on java using the codebase as it is would likely put a lot of pressure there.

Initially, I wanted to use leveldb as a storage engine for RavenDB. Since I couldn’t get it compiling & working on Windows (and yes, that is a hard requirement. And yes, it has to be compiling on my machine to apply), I thought about just porting it. That isn’t going to be possible. At least not in the trivial sense. Too much work is require to make it work properly.

Yes, I have an idea, no, you don’t get to hear it at this time Smile .

Apr 15 2013

Reviewing LevelDBPart XVII– Filters? What filters? Oh, those filters…

time to read 2 min | 243 words

Tweet Share Share 3 comments

Tags:

I was all set to finish this series, when I realized that I missed something important, I didn’t covered, and indeed, I don’t currently understand, what filters are, and how are they used.

As usual, once I actually got to the part of the code where this is actually handled (instead of ignoring it), it made a lot more sense:

And that, following with the rest of the code that I read makes a lot more sense.

A SST is a sorted table, essentially. We have the keys and the blocks, etc. And we can find a value very quickly. But what if you could do this even more cheaply?

Using a bloom filter is a good way to never get false negative, and it will reduce the amount of work we have to do if we can’t find the key in the SST drastically (only need to read the filter value, don’t even need to do any additional checks). Quite elegant, even if I say so myself.

There is one thing to pay attention to and that is that you can define your own comparator, in which case you must also define you own filter policy. If you use a comparator that ignore casing, you also need to provider a filter that ignore casing.

Apr 12 2013

Reviewing LevelDBPart XV–MemTables gets compacted too

time to read 2 min | 286 words

Tweet Share Share 0 comments

Tags:

As mentioned before, every 4 MB or so, we will move the current memtable and switch to a new one, turning the current one into an immutable memtable, which will then be compacted to disk. This post is about that process. The work there is done by the surprisingly name: DBImpl::CompactMemTable, which then calls to WriteLevel0Table.

That method build the actual table, using the same approach I already covered. Honestly, the code is pretty obvious now that I spent enough time reading this. The only thing of interest is DeleteObsoleteFiles() method, which looks interesting if only because it is one of the few places leveldb actually does real work with files. Mostly, however, ParseFileName is interesting:

It give me a good way to look at the actual files being used. This will be important when we will investigate how leveldb handles recovery.

Leveldb will consider files obsolete if:

They are log files and aren't of the current/previous log number.
If it is a manifest file, keep the current one (or later if there are any).
If it is a table file, keep it if it is in use. A table file may have been compacted away.
If it is a lock file, info log or the current db file.

All of this is interesting, but on a minor level. I think that we are nearing the end of the codebase now. The only thing that I really want to go into is the recovery parts, and that would pretty much be it.

Apr 11 2013

Reviewing LevelDBPart XVI–Recovery ain’t so tough?

time to read 3 min | 500 words

Tweet Share Share 1 comments

Tags:

This is how leveldb starts up a new db:

As you can imagine, this is quite useful to find out, since it means that everything we do on startup is recover. I do wonder why we have to take a lock immediately, though. I don't imagine that anything else can try to make use of the db yet, it hasn't been published to the outside world.

Recover is interesting (and yes, I know I wrote that a lot lately).

It starts by taking a lock on the db (File System lock, by holding to LOCK file).
If the db does not exists, we call NewDB(), which will create a default MANIFEST-00001 file and a CURRENT file pointing at that.
If the db exists, we call VersionSet::Recover(), this starts by reading the CURRENT file, which points to the appropriate MANIFEST file.
This then gives us the latest status of the db versions, in particular, it tells us what files belong to what levels and what they ranges are.
We check that the comparators we use is identical to the one used when creating the db.
The code make it looks like there might be multiple version records in the MANIFEST file, but I don't recall that. It might be just the way the API is structured, though. I just checked with the part that write it, and yes, it should have just one entry.
Once we have our versions, what next? We check that all of the expected files are actually there, because otherwise we might have a corrupted install.
The final thing we do is look for log files that are later than the latest we have in the MANIFEST. I am not sure if this indicates a recover from a crash or a way to be compatible with an older version. I guess that one way this can happen is if you crashed midway while committing a transaction.

When recovering, we are forcing checksum checks, to make sure that we don't get corrupt data (which might very well be the case, since the data can just stop at any point here. The relevant code here is leveldb::log:Reader, which takes care of reading potentially corrupt log file and reporting on its finding. I already went over how the log file is structured, the log reader just does the same thing in reverse, with a lot of paranoia. While reading from the log file, we build a memtable with the committed & safe changes. Here, too, we need to handle with memtable sizes, so we might generate multiple level 0 files during this process.

And... that is pretty much it.

I'll have another post summarizing this, and maybe another on what I intend to do with this information, but I'll keep that close to the chest for now.

Apr 10 2013

Reviewing LevelDBPart XIV– there is the mem table and then there is the immutable memtable

time to read 3 min | 488 words

Tweet Share Share 0 comments

Tags:

In leveldb we have the memtable, into which all of the current writes are going and the immutable memtable. So far, I haven't really checked what it actually means.

It looks like this happens on the write, by default, a mem table is limited to about 4 MB of space. We write to the memtable (backed by the tx log) until we get to the point where we go beyond the 4MB limit (note that again, if you have large values, you might go much higher than 4MB and then we switch to another memtable, change the existing memtable to be the immutable one, and move on.

Something that might trip you is that if you write 2 big values one after the other, in separate batches, the second one might need to wait for the compaction to complete.

Here is the test code:

   std::string big(1024 * 1024 * 5, 'a');
     
     for(int i=0; i < 10; i++ ){
        
         std::clock_t start = std::clock();
       
         std::ostringstream o;
         o << "test" << i;
         std::string key =  o.str();


         db->Put(writeOptions, key, big);
    
         std::clock_t end = std::clock();
   
         std::cout << i << ": " << 1000.0 *(end - start) / CLOCKS_PER_SEC << "ms " << std::endl;
   
     }

And the output is:

0: 20ms
1: 30ms
2: 50ms
3: 50ms
4: 50ms
5: 50ms
6: 40ms
7: 60ms
8: 50ms
9: 50ms

Not really that interesting, I'll admit. But when I tried it with 50 mb for each value, it felt like the entire machine grind to a halt. Considering the amount of times memory is copied around, and that it needs to be saved to at least two locations (log file & sst), that makes a lot of sense.

For references ,those were my timing, but I am not sure that I trust them.

0: 170ms
1: 310ms
2: 350ms
3: 460ms
4: 340ms
5: 340ms
6: 280ms
7: 530ms
8: 400ms
9: 1200ms

Apr 09 2013

Reviewing LevelDBPart XIII–Smile, and here is your snapshot

time to read 3 min | 462 words

Tweet Share Share 2 comments

Tags:

Okay, now that I know how data actually gets to the disk and from it, it is time to read how leveldb handles snapshots. Snapshots seems to be very simple on the surface. On every write, we generate a sequence number. We store the number locally and use the oldest still living snapshot as the oldest version that we have to retain when doing compaction work.

I had hard time figuring out how it worked out with regards to using this in memory. Consider the following code:

     leveldb::WriteOptions writeOptions;
     writeOptions.sync = true;
     
     db->Put(writeOptions, "test", "one");
     
     const leveldb::Snapshot* snapshot = db->GetSnapshot();
     
     db->Put(writeOptions, "test", "two");
  
    leveldb::ReadOptions readOptions;
    readOptions.snapshot = snapshot;
    std::string val;
    status = db->Get(readOptions, "test", &val);

This will properly give val == “one” when we execute it.

As it turned out, I missed something when I read the code earlier for MemTables. The value that is actually stored as the key is [key][tag]. And the tag is the sequence number + write type. And because of the way it is sorted (little endian, always), it means that higher values are sorted first. And what that means in turn is that unless you specify a specific snapshot number (which is what this tag contains, most of the time), you are going to get the latest version. But if you specify a snapshot number, you’ll get the value that was there as of that snapshot.

And that, in turn, means that we can write code like this:

Where key.memtable_key() contains the required snapshot value. So we can just skip all the ones larger than what we want.

That is really cool, but what about when we go to disk? Pretty much in the same way. The actual key value include the sequence & tag value. But the comparator knows to filter it out when needed. This is quite nice, and an elegant way to handle this situation.

Apr 08 2013

Reviewing LevelDBPart XII–Reading an SST

time to read 6 min | 1035 words

Tweet Share Share 1 comments

Tags:

After figuring out how the TableCache works, I now want to dig a bit deeper into what the Tables are. That means that we are starting out in Table::Open. And this starts with:

So we start by reading the footer. Then we read the index block:

Construct a new table, and we are pretty much set. I went back to look at the TableBuilder, and I think that I am getting things better. There is the BlockHandle, which is basically just the offset / size in the file.

Then we have the actual index, which is located at the end of the file. This one has the format of:

key, offset, size
key, offset, size
key, offset, size

The only problem is that I don't see yet where this is actually used. And here it is, inside Table::InternalGet.

So we seek into the block data. And that matches pretty closely what I would expect here. And then we have this:

Some notes about this code, it looks like we are going to be loading the entire block into memory. But what happens if we have a very large value? For that matter, what happens if we have a very large value next to a very small value on the same block, but we wanted the small value?

I think that I am wrong here. Looking at the code ,I found this comment, which I previously missed:

And that explains it really nicely with regards to how blocks works in general. The cool part happens inside Block::Iter::Seek, we first do a binary search for the prefixes that we have inside the block, and then a linear search inside the restart section. By default, there are 16 items per restart, so that is going to generate a really fast turnaround in general.

One key point I think that bear repeating is with regards to my previous comment about sizes. We don't actually ever copy the entire block to memory, instead, we rely on the OS to do so for us, because the files are actually memory mapped. That means that we can easily access them with very little cost, even if we have mixed big & small values on the same block. That is because we can just skip the large value and not touch it,and rely on the OS to page the right pages for us. That said, when reading a block, not just iterating it, we are going to allocate it in memory:

So I am not sure about that. I think that I was mistaken previously. The only problem is that the buf is actually used for scratch purposes. I think that this is right on both ends, looking at the code, we have PosixRandomAccessFile:

This is the standard approach, actually. Read it from the disk into the buffer, and go on with your life. But we also have PosixMMapReadableFile implementation:

And this is a lot more reasonable and in line with what I was thinking. We don't use the scratch at all.

However, we still need to allocate this scratch buffer on every single write. Looking at the code, the decision on what to allocate seems to be done here:

As you can see, it is mmap_limit_ that will make that determination. That is basically limiting us to no memory maps on 32 bits (make sense, the 2 GB virtual address space is really small) and to a max of 1,000 mapped files for 64 bits. Given that I am assuming that you are going to run this on 64 bits a lot more often than on 32 bits (at least for server apps), it would make more sense...

Stopping right here. Leveldb is being used in the browser as well, and presumably on mobile devices, etc. That means that you can't really assume/require 64 bits. And given that most of the time, you are going to have blocks of up to 4 KB in size (except if you have very large keys), I think that this is reasonable. I would probably have done away with allocating the buffer in the happy case, but that is beside the point, probably, since most of the time I assume that the value are small enough.

I am looking at this through the eyes of someone who deals with larger values all the time, so it triggers a lot of introspection for me. And this is how we actually read a value from disk, I actually managed to also figure out how we write, and in what format. All together good day's (actually, it was mostly at night) fun.

Apr 05 2013

Reviewing LevelDBPart XI–Reading from Sort String Tables via the TableCache

time to read 6 min | 1034 words

Tweet Share Share 2 comments

Tags:

In the previous post, I focused mostly on reading the code for writing a SST to disk. But I have to admit that I am not really following how you read them back in a way that would be easy to read.

In order to understand that, I think that the right place in the code would be the TableCache. The API for that is pretty slick, here are just the header (comments were stripped).

class TableCache {
 public:
  TableCache(const std::string& dbname, const Options* options, int entries);
  ~TableCache();

  Iterator* NewIterator(const ReadOptions& options,
                        uint64_t file_number,
                        uint64_t file_size,
                        Table** tableptr = NULL);

  Status Get(const ReadOptions& options,
             uint64_t file_number,
             uint64_t file_size,
             const Slice& k,
             void* arg,
             void (*handle_result)(void*, const Slice&, const Slice&));

  void Evict(uint64_t file_number);

 private:

  Status FindTable(uint64_t file_number, uint64_t file_size, Cache::Handle**);
};

There are a couple of interesting things here that show up right away. How do you know what is the number of entries. I guess that this is stored externally, but I am not sure where. I am going to figure that question out first.

And the answer is strange:

It isn't number of entries in the table, it is the number of files? That actually say something very important, since this means that the table cache is reading multiple SST files, rather than just one per cache. Looking at the rest of the API, it makes even more sense. We need to pass the file number that we are going to be reading. That one is going to be got from the Version we are using.

Side note: I just spend an hour or so setting up a tryout project for leveldb, so I can actually execute the code and see what it does. This had me learning cmake (I am using KDevelop to read the code) and I finally got it. Still haven't figure out how to step into leveldb's code. But that is a good step.
Also, the debug experience is about 2/10 compared to VS.
And damn, I am so not a C++ developer any longer. Then again, never was a real dev on linux, anyway.

The first thing the TableCache does is to setup a cache. This is interesting, and I followed the code, it create 16 internal caches and has between them. I think that this is done to get concurrency because all the internal cache methods looks like this:

Note the mutex lock. That is pretty much how it works for all of the cache work. Looking deeper into the code, I can see another reason why I should be grateful for staying away from C++. leveldb comes with its own hash table functionality. At any rate, the cache it using LRU to evict items, and it get the number of entries from the value that was passed in the ctor. That make me think that it hold the opened files.

Speaking of the cache, here is an example of the code using it:

The cache is also used to do block caching, this is why it takes a slice as an argument. I'm going to go over that later, because this looks quite interesting. The rest of this method looks like this:

So the only very interesting bit is the Table::Open. The rest is just opening the mmap file and storing it in the cache. Note that the actual file size is passed externally. I am not really sure what this means yet. I'll go over the table code later, right now I want to focus on the table cache.

Looking over the TableCache interface, we can see that we always get the file size from outside. That got me curious enough to figure out why. And the answer is that we always have it in the FileMetaData that we previously explored. I am not sure why that is so important, but I'll ignore it for now.

The rest of TableCache is pretty simple, although this made me grin:

More specifically, look at the RegisterCleanup, this is basically registering for the delete event, so they can unref the cache. Pretty nice, all around.

The rest of the code is just forwarding calls to the Table, so that is what I'll be reading next...

Apr 04 2013

Reviewing LevelDBPart X–table building is all fun and games until…

time to read 3 min | 502 words

Tweet Share Share 6 comments

Tags:

It seems like I have been trying to get to the piece that actually write to disk for a very long while.

The API from the header file looks to be pretty nice. I would assume that you would call it something like this:

TableBuilder builder(options,writableFile);
builder.Add(“one”, “bar”);
builder.Add(“two”, “foo”);
builder.Finish();

There are some things that looks funky. Like Flush(), which has a comment that I don't follow: Can be used to ensure that two adjacent entries never live in the same data block.

Oh well, maybe I need to actually read the code first?

And immediately we have this:

I think that this explain why you have Flush(), this forces different blocks. Most of the time, you let leveldb handle that, but you can also control that yourself if you really know what you are doing.

There is something called FilterBlockPolicy, but I don't get what it is for, ignoring for now.

Calling Add() does some really nice things with regards to the implementation. Let us say that we are storing keys with similar postfixes. That is very common.

When considering leveldb for RavenDB, we decided that we would store the data in this manner:

/users/ayende/data = {}
/users/ayende/metadata = {}

Now, when we want to add them, we add the first one and then the second one, but we have a chance to optimize things.

We can arrange things so we would output:

0,19,2
/users/ayende/data
{}
15, 9, 3
metadata
{}

Note that for the second key, we can reuse the first 15 characters of the previous entry. And we just saved some bytes on the disk.

Now, blocks are held in memory, and by default they are flushed every 4 KB or so.

Side note: looking at the code, it is really scary just how many times data is copied from one memory location to another in the leveldb codebase. Client code to WriteBatch, WriteBatch to MemTable, MemTable to TableBuilder, etc.

Okay, so now I am pretty sure that I am following what is going on when writing to disk. Still not sure how you search in an SST. The data is sorted, but that looks like it would require multiple seeks .There is metadata & index blocks in there, but I am not quite sure how they would work. Need to read the reader side of things to really get things. Well, I'll do that next time...

Apr 03 2013

Reviewing LevelDBPart IX- Compaction is the new black

time to read 4 min | 726 words

Tweet Share Share 2 comments

Tags:

After going over the VersionSet, I understand how leveldb decide when to compact and what it decide to compact. More or less, at least.

This means that I now mostly can figure out what this does:

Status DBImpl::BackgroundCompaction()

A user may force manual compaction of a range, or we may have reasons of our own to decide to compact, based on leveldb heuristics. Either way, we get the Compaction object, which tells us what files we need to merge.

There is a check there whatever we can do a trivial compaction, that is, just move the file from the current level to level+1. The interesting thing is that we avoid doing that if this is going to cause issues in level+2 (require more expensive compaction later on).

But the interesting work is done in DoCompactionWork, where we actually do compaction of complex data.

The code refers to snapshots for the first time that I see. We only merge values that are in a valid snapshot. So data doesn't “go away” for users. While holding a snapshot active.

The actual work starts in here:

This give us the ability to iterate over all of the entries in all of the files that need compaction.

And then we go over it:

But note that on each iteration we do a check if we need to CompactMemTable(); I followed the code there, and we finally write stuff to disk! I am quite excited about this, but I'll handle that in a future blog post. I want to see how actual compaction works.

We then have this:

This is there to stop a compaction that would force a very expensive compaction next time, I think.

As a side note, this really drive me crazy:

Note that we have current_output() and FileSize() in here. I don't really care what naming convention you use, but I would really rather that you had one. If there is one for the leveldb code base, I can't say that I figured it out. It seems like mostly it is PascalCase, but every now & then we have this_style methods.

Back to the code, it took me a while to figure it out.

Will return values in sorted key order, that means that if you have the same key in multiple levels, we need to ignore the older values. After this is happening, we now have this:

This is where we are actually writing stuff out to the SST file! This is quite exciting :-). I have been trying to figure that out for a while now.

The rest of the code in the function is mostly stuff related to stats book keeping, but this looks important:

This generate the actual VersionEdit, which will remove all of the files that were compacted and add the new file that was created as a new Version to the VersionSet.

Good work so far, even if I say so myself. We actually go to where we are building the SST files. Now it is time to look at the code that build those table. Next post, Table Builder...

Oren Eini

Oren Eini

CEO of RavenDB

Reviewing LevelDBPart XVIII–Summary

Reviewing LevelDBPart XVII– Filters? What filters? Oh, those filters…

Reviewing LevelDBPart XV–MemTables gets compacted too

Reviewing LevelDBPart XVI–Recovery ain’t so tough?

Reviewing LevelDBPart XIV– there is the mem table and then there is the immutable memtable

Reviewing LevelDBPart XIII–Smile, and here is your snapshot

Reviewing LevelDBPart XII–Reading an SST

Reviewing LevelDBPart XI–Reading from Sort String Tables via the TableCache

Reviewing LevelDBPart X–table building is all fun and games until…

Reviewing LevelDBPart IX- Compaction is the new black

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed