Reverse engineering the Smaz compression library
Smaz is a small library for compressing small strings. It is strange to talk about reverse engineering such a project, because Smaz is open source and is available on GitHub. Nevertheless the code on GitHub is actually only part of the story. It is dense, and it was generated by a script that isn't included. That means that when we read the code, we are missing some key components, in particular, you are looking at an end result, without knowing how you got there.
Compression algorithms take advantage of repetitions in text to reduce the overall size. State of the art algorithms typically use Lempel-Ziv and Huffman coding to compress the data. But because they are typically making no assumptions about the data, and they require a bit of length before they can gather enough state to truly gather some steam and start reducing the output length.
That means that for small strings (up to a hundred bytes or so), there is no real benefit in standard compression algorithms, and often you'll see an increase in the space taken. Smaz is meant to solve that for most common cases. It does so by using a prepared dictionary of common terms that can be easily compressed.
The first thing that you'll encounter when you read the Smaz code is this:
This goes on for about 50 lines, and at first glance, it looks utterly incomprehensible. We'll ignore this for now, and read further. The next item of interest is:
The cb postfix stands for code book, and rcb stands for reverse code book.
It took me a while to figure it out, but the idea in the code book entries is that this is actually a hash table. Each string in the codebook is actually multiple entries, composed of a single byte length, the term text, and then the index in the codebook. There are possibly multiple entries in each value in the array, and we use the length of each entry to index into them.
Let us see how this works:
int smaz_compress(char *in, int inlen, char *out, int outlen) { unsigned int h1,h2,h3=0; int verblen = 0, _outlen = outlen; char verb[256], *_out = out; while(inlen) { int j = 7, needed; char *flush = NULL; char *slot; h1 = h2 = in[0]<<3; if (inlen > 1) h2 += in[1]; if (inlen > 2) h3 = h2^in[2]; if (j > inlen) j = inlen;
This code just setup a temporary buffer, and hash the current input term's bytes. Then we start getting interesting.
/* Try to lookup substrings into the hash table, starting from the * longer to the shorter substrings */ for (; j > 0; j--) { switch(j) { case 1: slot = Smaz_cb[h1%241]; break; case 2: slot = Smaz_cb[h2%241]; break; default: slot = Smaz_cb[h3%241]; break; } while(slot[0]) { if (slot[0] == j && memcmp(slot+1,in,j) == 0) { /* Match found in the hash table,
The value j is the current size we are checking, and we are using the hash value to get a particular slot in the hash table. Note that only the first 3 bytes are actually hashed. After the appropriate slot is found, we check whatever the term length is a match, and if so, whatever the actual strings match. If they don't, we run the following code:
slot += slot[0]+2;
this was a bit confusing at first, but what is basically going on here is that we move to the next entry in this slot. This is pretty neat, since it covers both empty slots (defined in the code as "") and multi value slots and take advantage on the fact that C strings end with \0 without having to specify it explicitly.
So what happens if we have a match?
In this case, we check if there are any verbatim bytes (bytes that we haven't been able to compress that are stored in a buffer.). If there are such bytes, we do the following:
- Compute the space needed for the verbatim value.
- Set the next flush point in the buffer for verbatim values to the current output location.
- Move the output pointer after the location where the verbatim string will be written.
- Reduce the length of the remaining output size by the verbatim length.
Note that we aren't actually going to write anything yet. First, we need to emit the newly captured match to the code book:
This is done by first checking if there is space (if there isn't, we return an error).
Then we write a byte to the output buffer. But this is a very strange byte.
What is this "slot[slot[0]+1]" ? Well, remember the structure that we talked about for the hash table entries? The first byte is a length, and the last byte is the index into the code book. So what we are actually doing is indexing into the entry to get the last value, which his the code book index, which we write to the output.
The rest is pretty much just book keeping information. We move the input buffer pointer according to the just discovered term, etc.
If no match was found, we just add the current byte to the verbatim buffer, and move on to the next one.
Now, let us look at this out label, and what it does:
Basically, it is a repeat of the earlier code when finding a match. If we have a large enough verbatim string, we need to flush it, so this takes care of it. The actual flushing is interesting:
If there is a single byte, we write 254 (single verbatim byte marker), then we write the byte. If there is more than a byte, we write 255 (variable length size), the length, then the verbatim string.
It is interesting to note that this can be written after the output byte has been written. It took me a while to understand this. Nothing wrong with it, but I like sequential writes much better.
The decompression is very simple. Read a byte from the compressed input, if it is 254, then the next byte should be copied verbatim to the output. If the byte is 255, then read the next byte, which is the length, and copy the next length bytes to the output. Finally, if the byte isn't 255 or 254, it is an index into the code book table, and the value should be taken from there.
I wrote a managed implementation of this, which allows me to play much more easily with the codebook definition. This was because there are a bunch of terms in the Smaz codebook that aren't really relevant for what we need. In particular, it looks like it was trained to find the most common values from a set of HTML documents, it has entries for "<div>", "><", etc. Instead of going with the 254 values that can be compressed, I pruned the list a bit, and ended up with just 245 items.
I was then able to change the compression format to be:
- If it is a value under 245, use the term from the codebook.
- If it is a value higher than 245, it is a verbatim value whose actual length need to be figured out by subtracting 245 from it.
This allows me to save a byte if we have more than two verbatim bytes.
Smaz is a really simple and clean library, but it isn't doing something very complex. It is a static shared dictionary approach, without the use of more advanced approaches. I previously written about a more complex compression systems for small strings, such as FemtoZip. You can find the previous post here (there was a whole series of them).
In my next post, I'll try to compare the two options.
Comments
You could not compress the first 1000 documents. Then, learn a dictionary from them and use that for future documents.
what's the performance penalty for this compression? is it possible to compress by "column" ? there are some databases that doing similar "column compression".
@Uri for that you need a Columnar Storage (which Voron is not). Those types of storages are great for OLAP kind of work, but not so good for other uses. So in the end it is a tradeoff between the use cases and the amount of compression you need.
@Mark you have a point there, but you also have to consider that now you have a single point of failure (the dynamic dictionary). If the storage (for any reason) loses the dictionary now all the records are unrecoverable. With an static document you can still look over the entire database and still recover all data.
This reminds me of Brotli which is rolling out in Firefox and Chrome soon. (https://en.wikipedia.org/wiki/Brotli)
I did a managed port in '11 - code is very different though :)
https://github.com/poulfoged/SentenceCompression/blob/master/Source/Devchamp.SentenceCompressor/SentenceCompressor.cs
Mark, The problem with doing that is that you are basing your dictionary on the first 1000 documents. If they aren't similar to the rest, you are actually going to get pretty bad results. Consider the case of a system that first writes 10,000 users to the system (ported from another system), then start writing all sort of other documents. The same, by the way, is also true if you limit it to the same collection only. So the first few thousand records are from an old system, and all the new (and much more numerous) items are actually from the new system, and have radically different structure / values.
This leads to the requirement that you'll need to update the dictionary every now and then, which actually lead to versioning issues, and maintaining the dictionary in the face of concurrent writes, etc, etc etc. It is really complex.
Uri, On small documents, we are comparable in speed to standard JSON parsing with Newtonsoft.Json. On large documents / many documents, we see roughly 20 - 30% speed penalty for using compression. But we also see roughly 40% reduction in the size of the documents
Poul, Your code is a lot more readable. But it also have a very high computational cost because of the abstractions you are using doing a lot of hidden operations
Did you see this library? https://github.com/garysharp/SmazSharp
@Joseph for general purpose compression we are willing to pay the price of lesser compression in exchange of runtime performance. Voron uses LZ4 which is by default at least twice at fast than defrate at compression and decompression tasks, at the expense of final size off course. Brotli has better space efficiency at the performance profile (MB per second) than defrate. But all three suffer in the small strings world, that is where Smaz is used and shines at Voron.
Eric, Yes, I saw that. It took the Smaz's way of doing all the work in code generation even further. It is pretty much impossible to read it without knowing ahead of time what it is doing
Comment preview