Reviewing LevelDBPart III, WriteBatch isn’t what you think it is

time to read 26 min | 5198 words

One of the key external components of leveldb is the idea of WriteBatch. It allows you to batch multiple operations into a single atomic write.

It looks like this, from an API point of view:

   1: leveldb::WriteBatch batch;
   2: batch.Delete(key1);
   3: batch.Put(key2, value);
   4: s = db->Write(leveldb::WriteOptions(), &batch);

As we have learned in the previous post, WriteBatch is how leveldb handles all writes. Internally, any call to Put or Delete is translated into a single WriteBatch, then there is some batching involved across multiple batches, but that is beside the point right now.

I dove into the code for WriteBatch, and immediately I realized that this isn’t really what I bargained for. In my mind, WriteBatch was supposed to be something like this:

   1: public class WriteBatch
   2: {
   3:    List<Operation> Operations;
   4: }

Which would hold the in memory operations until they get written down to disk, or something.

Instead, it appears that leveldb took quite a different route. The entire data is stored in the following format:

   1: // WriteBatch::rep_ :=
   2: //    sequence: fixed64
   3: //    count: fixed32
   4: //    data: record[count]
   5: // record :=
   6: //    kTypeValue varstring varstring         |
   7: //    kTypeDeletion varstring
   8: // varstring :=
   9: //    len: varint32
  10: //    data: uint8[len]

This is the in memory value, mind. So we are already storing this in a single buffer. I am not really sure why this is the case, to be honest.

WriteBatch is pretty much a write only data structure, with one major exception:

   1: // Support for iterating over the contents of a batch.
   2: class Handler {
   3:  public:
   4:   virtual ~Handler();
   5:   virtual void Put(const Slice& key, const Slice& value) = 0;
   6:   virtual void Delete(const Slice& key) = 0;
   7: };
   8: Status Iterate(Handler* handler) const;

You can iterate over the batch. The problem is that we now have this implementation for Iterate:

   1: Status WriteBatch::Iterate(Handler* handler) const {
   2:   Slice input(rep_);
   3:   if (input.size() < kHeader) {
   4:     return Status::Corruption("malformed WriteBatch (too small)");
   5:   }
   6:  
   7:   input.remove_prefix(kHeader);
   8:   Slice key, value;
   9:   int found = 0;
  10:   while (!input.empty()) {
  11:     found++;
  12:     char tag = input[0];
  13:     input.remove_prefix(1);
  14:     switch (tag) {
  15:       case kTypeValue:
  16:         if (GetLengthPrefixedSlice(&input, &key) &&
  17:             GetLengthPrefixedSlice(&input, &value)) {
  18:           handler->Put(key, value);
  19:         } else {
  20:           return Status::Corruption("bad WriteBatch Put");
  21:         }
  22:         break;
  23:       case kTypeDeletion:
  24:         if (GetLengthPrefixedSlice(&input, &key)) {
  25:           handler->Delete(key);
  26:         } else {
  27:           return Status::Corruption("bad WriteBatch Delete");
  28:         }
  29:         break;
  30:       default:
  31:         return Status::Corruption("unknown WriteBatch tag");
  32:     }
  33:   }
  34:   if (found != WriteBatchInternal::Count(this)) {
  35:     return Status::Corruption("WriteBatch has wrong count");
  36:   } else {
  37:     return Status::OK();
  38:   }
  39: }

So we write it directly to a buffer, then read from that buffer. The interesting bit is that the actual writing to leveldb itself is done in a similar way, see:

   1: class MemTableInserter : public WriteBatch::Handler {
   2:  public:
   3:   SequenceNumber sequence_;
   4:   MemTable* mem_;
   5:  
   6:   virtual void Put(const Slice& key, const Slice& value) {
   7:     mem_->Add(sequence_, kTypeValue, key, value);
   8:     sequence_++;
   9:   }
  10:   virtual void Delete(const Slice& key) {
  11:     mem_->Add(sequence_, kTypeDeletion, key, Slice());
  12:     sequence_++;
  13:   }
  14: };
  15:  
  16: Status WriteBatchInternal::InsertInto(const WriteBatch* b,
  17:                                       MemTable* memtable) {
  18:   MemTableInserter inserter;
  19:   inserter.sequence_ = WriteBatchInternal::Sequence(b);
  20:   inserter.mem_ = memtable;
  21:   return b->Iterate(&inserter);
  22: }

As I can figure it so far, we have the following steps:

  • WriteBatch.Put / WriteBatch.Delete gets called, and the values we were sent are copied into our buffer.
  • We actually save the WriteBatch, at which point we unpack the values out of the buffer and into the memtable.

It took me a while to figure it out, but I think that I finally got it. The reason this is the case is that leveldb is a C++ application. As such, memory management is something that it needs to worry about explicitly.

In particular, you can’t just rely on the memory you were passed to be held, the user may release that memory after they called to Put. This means, in turn, that you must copy the memory to memory that leveldb allocated, so leveldn can manage its own lifetime. This is a foreign concept to me because it is such a strange thing to do in the .NET land, where memory cannot just disappear underneath you.

On my next post, I’ll deal a bit more with this aspect, buffers management and memory handling in general.

More posts in "Reviewing LevelDB" series:

  1. (26 Apr 2013) Part XVIII–Summary
  2. (15 Apr 2013) Part XVII– Filters? What filters? Oh, those filters…
  3. (12 Apr 2013) Part XV–MemTables gets compacted too
  4. (11 Apr 2013) Part XVI–Recovery ain’t so tough?
  5. (10 Apr 2013) Part XIV– there is the mem table and then there is the immutable memtable
  6. (09 Apr 2013) Part XIII–Smile, and here is your snapshot
  7. (08 Apr 2013) Part XII–Reading an SST
  8. (05 Apr 2013) Part XI–Reading from Sort String Tables via the TableCache
  9. (04 Apr 2013) Part X–table building is all fun and games until…
  10. (03 Apr 2013) Part IX- Compaction is the new black
  11. (02 Apr 2013) Part VIII–What are the levels all about?
  12. (29 Mar 2013) Part VII–The version is where the levels are
  13. (28 Mar 2013) Part VI, the Log is base for Atomicity
  14. (27 Mar 2013) Part V, into the MemTables we go
  15. (26 Mar 2013) Part IV
  16. (22 Mar 2013) Part III, WriteBatch isn’t what you think it is
  17. (21 Mar 2013) Part II, Put some data on the disk, dude
  18. (20 Mar 2013) Part I, What is this all about?