Pages

Aug 6, 2013

CSS Compression : Block Sorting

This week, I decided to take a stab at a different approach to compression, and attempted to create a 'text compression booster' that is, a process to modify a file such that it gzip's smaller than the original file. The perfect target? CSS Blocks.


To this end, it’s important to know that GZIP relies on a modified LZ77 codec, which employs a sliding-window compression system. This type of algorithm will consider the a local set of bytes in the stream, and search a set of bytes that are listed prior to the target set in an attempt to find matches.This type of system exploits locality, or rather, it assumes that similar occurrences of a string tend to be clustered in a file. As you can imagine, the number of previous bytes you check allows you find more accurate, and longer strings. In fact, some of the more successful LZ77 encoders increase the window size close to available memory limits in order to find the best matches.


CSS can be described as a block-based format; or rather, there are logical blocks of text (grouped with brackets) whose order can be independent of each other in the file, and who’s block-contents can be ordered independently as well. Despite some edge cases built for cascading, most CSS can get away with being arbitrarily re-organized and still have valid results.


Knowing this, it seems like a logical proposal that shuffling our CSS blocks around can yield increased compression, as more correlated symbols can exist in the sliding window.


Rather than being at the mercy of the sliding window, we explicitly re-position the blocks such that we should yield the smallest result.


A proposed algorithm

The steps:
  1. parse a CSS file into memory, and tokenize by each property block
    1. block name and brackets included in a block
  2. sort by lowest entropy (or better, estimated bits required, from entropy)
  3. test-string = block[0]; indexlist[0] = 0
  4. while not all blocks processed
    1. for each block
      1. newString = test-string + block[i]
      2. len = LZ77.compress(newString)
      3. if len < minLen,
        1. mark this block as smallest
        2. indexlist.append(smallestBlockIndex)
        3. test-string += smallestBlock.text
    2. remove smallest block from the option pool

Effectively,
attempt each block against LZ77 to see which will yield the best results, encode that as the next block.

Horrible failure
Testing a few CSS files showed that not only did this not produce savings, but in most cases, caused a bloat.


file
gzip’d
resorted gzip’d
dyn_1.css
3,126
3,145 (growth of 0.6%)
bootstrap.css
15,543
15,594
bs3.css
15,686
15,801
10.css
61,54161,948

To be fair, my LZ77 parser was less than optimal, I only used 10mb of memory for the search space, so there’s a chance that it was not producing the best measurable output against what GZIP would be doing.


What’s worse is that calculating the results of the single 6k file took about 53 minutes.....


A faster measurement

By re-encoding the string each block permutation, we’re effectively determining if the amount of saved bytes created by each sorted pair. As the string size increases, or the number of blocks increase our algorithm gets slower.  Ideally, we would like a way to test how the compression would respond, considering the proposed new string without having to re-encode every possible combination.


To this measure, I found a statistical algorithm that seem to fit the bill: KLDivergence, from Wikipedia:


KL measures the expected number of extra bits required to code samples from P when using a code based on Q, rather than using a code based on P


Effectively, we can remove the actual LZ77 compression with a proposed one that will run execute faster, with less memory, and give measurable results.


The good news is that a small implementation, re-running the tests only took 12 minutes to complete on a 305K file, and that a comparison of the LZ77 brute-force shows about 2% difference in selection against these two algorithms, meaning that they are accurate enough to warrant valid usage. (Or rather, using these metrics produce 98% of the same decisions that the brute-force method used)
Sadly though we still saw no savings in the original file size.


Source size (Gmail10.css)
gmail10.gz
sorted_gmail10.gz
312,906
60,541
61,948 (growth of 2%)


Even with >10k files, this re-ordering yielded no observable savings to the final file.


Why no change?
Recall that GZIP comprises of two parts, LZ77 and Huffman encoders. LZ77 is a dictionary algorithm; it’s built to reduce repetitions of characters to smaller tokens. Huffman is an entropy encoder, where it produces the minimal bit-string based upon a frequency of a token.
We’ve shuffled the CSS blocks around, which did nothing to change the entropy of the stream; the letter A still occurs just as frequent as it did before. Thus we don’t expect to get any savings from the huffman stage.
Really, this type of reshuffling is targeted to hit the LZ77 phase, where we’d hope that grouped symbols would find more matches in the sliding window process; however it’s worth noting that the default window size for LZ77 is about 32K - 64k (depending on your extensions); Meaning we would need to use a larger file, with a large amount of moved blocks to see a savings (assuming that blocks in the window tend to be heterogeneous).


Conclusions

In the end, block-sorting our file did nothing to help GZIP.  In fact, most times it produced bloat, which is a little scary for me. The Occam's-razor assumption is that my encoder is adding bloat somehow to the file, skewing results. The other possibility suggests that there is, in fact an optimal and sub-optimal layout for CSS, otherwise why would bloat occur?


Thinking of this a different way, Burrows Wheeler Transform (or BWT) is another block-based transform that tends to be a critical component of most cutting edge compressors out there (not to mention the backbone of BZIP2, which we’ve noted consistently beats GZIP in size and speed); It works by transforming the string into a semi sorted representation, such that the data is more compressible given statistical encoders like RLE, and Arithmetic compressors, but as noted by many papers, BWT does nothing to really help LZ77 or Huffman.


This matches my tests; Both block-based sorting methods failed to produce savings as a compression booster for GZIP.

~Main

You can find Colt McAnlis here:

  

3 comments: