Static dictionary vs. dynamic dictionary tryouts
In the previous post, I discussed the way the Smaz library works. Smaz uses a static shared dictionary, that was produced once from a known training set. FemtoZip, on the other hand, also uses a shared dictionary, but it is using a much more sophisticated system. To start with, it allows you to create the shared dictionary from your own data, not just the fixed data such as Smaz. However, that does require you to carefully pick your training set.
And of course, once you have picked a training set, it is very hard if not impossible to change without complex versioning rules. I would have expected the FemtoZip version to compress much better, but it depend heavily on the training set it has, and selecting the appropriate one is relatively hard in a general purpose manner.
In my tests, I wasn't able to find a training set that would do better than Smaz, I tried a bunch, including using Smaz's own dictionary, but it seems like Smaz has done a pretty good job in selecting the relevant outputs. Or perhaps the strings I was testing were optimal for Smaz, I'm not sure.
I also looked at Shoco, but it took very little time to rule it out. It only supports ASCII strings, which is about as far from helpful as you can get, in this day and age. Smaz isn't doing too great on non ASCII characters, but it has a fixed growth per length of unfamiliar terms, which is much better than doubling the size as in Shoco.
For our needs, I believe that we'll need to have a static version of the dictionary anyway, so there isn't a major benefit to being able to train them.
Comments
How are different languages handled with these libs? An example would be a database - something like wikipedia - that has contents of multiple languages. Depending on allowed performance penalty for store operations several sets of cookbooks could be provided and the database tries some of them for a whole document and takes the cookbook providing best results - when a sufficent compression ratio is gained a result is acceped without trying all available cookbooks. Dynamic sorting of cookbook priority (like most documents perform best / well enough using cookbook X) might reduce the penalty. The identifier for the cookbook could be stored in metadata.
this would also allow migration to newer / better / optimized cookbooks over time without having to parse all existing documents again. just modifying an existing cookbook is still an issue.
You usually dont, mainly because the runtime cost to try all potential cookbooks for the data make it prohibitely to do. For example, just a single Smaz tryout puts the compressed format (even if you end up keeping the uncompressed value) on par or worse than the standard parser. Adding an additional pass would make that cost in the unacceptable category. So the best thing you can do is try if you can do better, and if you are not, just store the whole data and not make the guys using it waste cpu if it is not worth it.
Daniel, Smaz is pretty good even in non English languages using Latin alphabet. The sample docs mention Italian. It does poorly for Unicode text (such as Hebrew), but that is fine, because we do one run and then back off if the compression rate isn't good.
the idea of dynamic codebook is nice, but in practice is is really complex. It doesn't take different languages.Consider this blog, I used to blog a lot about testing, mocking, nhibernate, etc. I now blog a lot about RavenDB, optimizations, I/O, etc.
The codebook that would compress nicely for the old posts is not relevant, so now you need to re-eval the codebook, and try to find a better match. In this case, it is usually better to use a dynamic dictionary system, but that requires that you have enough data to actually generate the dictionary
It's not only about language. Say for example i want to provide some kind or URL shortener or have on many documents a list of references to other documents. In both data types (which may be strings) there is no need to have natural or spoken language contained.
Oren, as you mentioned, the usage of language changes and that's difficult to handle.
Maybe the system could use background data processing to generate (propably) better dictionaries that may be used on upcomin insert operations. This results in no penalty on the normal insert operation and uses available cpu cycles for self optimization. But the implementation of this is really hard - finding common parts on a large set of data is like a typical deduplication algorithm.
On the other hand: Deciding once for a shared or some kind of static approach to the "right" dictionary is not easy. Especially the case of a long lived dataset that changes over time is problematic.
Feels rigid in some way. I personally would prefer the dynamic approach as i dislike that a decision over an unkown dataset has to be made - the dynamic part is harder to build but doesn't make me decide for the wrong codebook (murphys law :-)).
Comment preview