Static dictionary vs. dynamic dictionary tryouts

time to read 2 min | 291 words

In the previous post, I discussed the way the Smaz library works. Smaz uses a static shared dictionary, that was produced once from a known training set. FemtoZip, on the other hand, also uses a shared dictionary, but it is using a much more sophisticated system. To start with, it allows you to create the shared dictionary from your own data, not just the fixed data such as Smaz. However, that does require you to carefully pick your training set.

And of course, once you have picked a training set, it is very hard if not impossible to change without complex versioning rules. I would have expected the FemtoZip version to compress much better, but it depend heavily on the training set it has, and selecting the appropriate one is relatively hard in a general purpose manner.

In my tests, I wasn't able to find a training set that would do better than Smaz, I tried a bunch, including using Smaz's own dictionary, but it seems like Smaz has done a pretty good job in selecting the relevant outputs. Or perhaps the strings I was testing were optimal for Smaz, I'm not sure.

I also looked at Shoco, but it took very little time to rule it out. It only supports ASCII strings, which is about as far from helpful as you can get, in this day and age. Smaz isn't doing too great on non ASCII characters, but it has a fixed growth per length of unfamiliar terms, which is much better than doubling the size as in Shoco.

For our needs, I believe that we'll need to have a static version of the dictionary anyway, so there isn't a major benefit to being able to train them.