Reviewing LevelDBPart VI, the Log is base for Atomicity
Here we are starting to get into the interesting bits. How do we actually write to disk. There are two parts of that. The first part is the log file. This is were all the recent values are stored, and it is an unsorted backup for the MemTable in case of crashes.
Let us see how this actually works. There are two classes which are involved in this manner. leveldb::log::Writer and leveldb::WritableFile. I think that WritableFile is the leveldb abstraction, so it is bound to be simpler. We’ll take a look at that first.
Here is what it looks like:
1: // A file abstraction for sequential writing. The implementation2: // must provide buffering since callers may append small fragments3: // at a time to the file.4: class WritableFile {5: public:6: WritableFile() { }
7: virtual ~WritableFile();8:
9: virtual Status Append(const Slice& data) = 0;10: virtual Status Close() = 0;11: virtual Status Flush() = 0;12: virtual Status Sync() = 0;13:
14: private:15: // No copying allowed16: WritableFile(const WritableFile&);17: void operator=(const WritableFile&);18: };
Pretty simple, overall. There is the buffering requirement, but that is pretty easy overall. Note that this is a C++ interface. There is a bunch of implementations, but the one that I think will be relevant here is PosixMmapFile. So much for it being simple. As I mentioned, this is Posix code that I am reading, and I have to do a lot of lookup into the man pages. The implementation isn’t that interesting, to be fair, and full of mmap files on posix minutia. So I am going to skip it.
I wonder why the choice was map to use memory mapped files, since the API exposed here is pretty much perfect for streams. As you can imagine from the code, calling Apend() just writes the values to the mmap file, flush is a no op, and Sync() actually ask the file system to write the values to disk and wait on that. I am guessing that the use of mmap files is related to the fact that mmap files are used extensively in the rest of the code base (for reads) and that gives leveldb the benefit of using the OS memory manager as the buffer.
Now that we got what a WritableFile is like, let us see what the leveldb::log::Writer is like. In terms of the interface, it is pretty slick, it has a single public method:
1: Status AddRecord(const Slice& slice);
As a remind, those two are used together in the DBImpl::Write() method, like so:
1: status = log_->AddRecord(WriteBatchInternal::Contents(updates));
2: if (status.ok() && options.sync) {3: status = logfile_->Sync();
4: }
From the API look of things, it appears that this is a matter of simply forwarding the call from one implementation to another. But a lot more is actually going on:
1: Status Writer::AddRecord(const Slice& slice) {2: const char* ptr = slice.data();3: size_t left = slice.size();
4:
5: // Fragment the record if necessary and emit it. Note that if slice6: // is empty, we still want to iterate once to emit a single7: // zero-length record8: Status s;
9: bool begin = true;10: do {11: const int leftover = kBlockSize - block_offset_;12: assert(leftover >= 0);
13: if (leftover < kHeaderSize) {14: // Switch to a new block15: if (leftover > 0) {16: // Fill the trailer (literal below relies on kHeaderSize being 7)17: assert(kHeaderSize == 7);
18: dest_->Append(Slice("\x00\x00\x00\x00\x00\x00", leftover));19: }
20: block_offset_ = 0;
21: }
22:
23: // Invariant: we never leave < kHeaderSize bytes in a block.24: assert(kBlockSize - block_offset_ - kHeaderSize >= 0);
25:
26: const size_t avail = kBlockSize - block_offset_ - kHeaderSize;27: const size_t fragment_length = (left < avail) ? left : avail;28:
29: RecordType type;
30: const bool end = (left == fragment_length);31: if (begin && end) {32: type = kFullType;
33: } else if (begin) {34: type = kFirstType;
35: } else if (end) {36: type = kLastType;
37: } else {38: type = kMiddleType;
39: }
40:
41: s = EmitPhysicalRecord(type, ptr, fragment_length);
42: ptr += fragment_length;
43: left -= fragment_length;
44: begin = false;
45: } while (s.ok() && left > 0);46: return s;47: }
Let us see if we do a lot here. But I don’t know yet what is going on. From the first glance, it appears that we are looking at fragmenting the value into multiple records, and we might want to enter zero length records (no idea what that is for?maybe compactions?).
It appears that we write in blocks of 32Kb at a time. Line 12 – 21 are dealing with how to finalize the block when you have no more space. (Basically fill in with nulls).
Lines 26 – 40 just set the figure out what the type of the record that we are going to work (a full record, all of which can sit in a single buffer, a first record, which is the start in a sequence of items or middle / end, which is obvious).
And then we just emit the physical record to disk, and move on. I am not really sure what the reasoning is behind it. It may be to avoid having to read records that are far too big?
I looked at EmitPhysicalRecord to see what we have there and it is nothing much, it writes the header, including CRC computation, but that is pretty much it. So far, a lot of questions, but not a lot of answers. Maybe I’ll get them when I’ll start looking at the reading portion of the code. But that will be in another post.
More posts in "Reviewing LevelDB" series:
- (26 Apr 2013) Part XVIII–Summary
- (15 Apr 2013) Part XVII– Filters? What filters? Oh, those filters…
- (12 Apr 2013) Part XV–MemTables gets compacted too
- (11 Apr 2013) Part XVI–Recovery ain’t so tough?
- (10 Apr 2013) Part XIV– there is the mem table and then there is the immutable memtable
- (09 Apr 2013) Part XIII–Smile, and here is your snapshot
- (08 Apr 2013) Part XII–Reading an SST
- (05 Apr 2013) Part XI–Reading from Sort String Tables via the TableCache
- (04 Apr 2013) Part X–table building is all fun and games until…
- (03 Apr 2013) Part IX- Compaction is the new black
- (02 Apr 2013) Part VIII–What are the levels all about?
- (29 Mar 2013) Part VII–The version is where the levels are
- (28 Mar 2013) Part VI, the Log is base for Atomicity
- (27 Mar 2013) Part V, into the MemTables we go
- (26 Mar 2013) Part IV
- (22 Mar 2013) Part III, WriteBatch isn’t what you think it is
- (21 Mar 2013) Part II, Put some data on the disk, dude
- (20 Mar 2013) Part I, What is this all about?
Comments
In case anyone else is interested, some of the reasons behind the log format are explained here http://leveldb.googlecode.com/git/doc/log_format.txt:
Some benefits over the recordio format:
(1) We do not need any heuristics for resyncing - just go to next block boundary and scan. If there is a corruption, skip to the next block. As a side-benefit, we do not get confused when part of the contents of one log file are embedded as a record inside another log file.
(2) Splitting at approximate boundaries (e.g., for mapreduce) is simple: find the next block boundary and skip records until we hit a FULL or FIRST record.
(3) We do not need extra buffering for large records.
Comment preview