Pages

Aug 13, 2013

JSON Compression : Transpose & Binary

JSON data comprises a large majority of content sent around the internet, especially for social networking sites and HTML5 games. Looking at most of these types of data, it’s apparent there’s a structure layout that we can take advantage of to boost GZIP compression.


Are JSON sizes a problem?

It’s been very clear that the latest reports from HTTPArchive show that the web is mostly images, JS, and CSS files. But knowing how HTTPArchive tests work, we have to assume that there’s a large amount not accounted for.


The amount of JSON data sent around the web each day remains largely unmeasured, HTTP Archive gets a general sense of it, that is, if a home-page receives a block of JSON, which has been transferred with a mime type that is grouped with JavaScript, then it’ll be reported there. However HTTPA doesn’t include any non-landing-page URLs, which typically is where the bulk of JSON transfer occurs, nor does it test the amount of JSON sent to native applications every day either.


Tracking it down could be easier, if all developers used the right mime-type, but as mentioned, this is not the case, as developers can easily send around JSON data as application/JavaScript or application/text, so that’s pretty hard to do.


The need to minimize JSON

JSON has been adopted as a format for data transfer because, among other reasons, it is usually more compact than alternatives such as XML. It also matches the object literal syntax in JavaScript, which has had a great result of parsing support has found its way into many languages, besides JavaScript.


Even with its savings over XML, JSON can still be considered a horribly bloated format for a number of reasons.
  1. All data must be converted to text-based encoding
  2. Excessive usage of quotes for property names
  3. When multiple objects are serialized in the same message, the key names for each property must be repeated, even though they are the same for each object.


With the movement towards the client-side asynchronous programming models (like AJAX), many web applications rely on JSON transport between client and server for client-side display.
  1. The client requests some block of data
  2. The server provides data to the client in the form of a JSON blob
  3. Client receives JSON, and uses it to format / display the information on the screen


You’re seeing more and more of the web work this way, including popular sites like  Pinterest who sends the client this JSON blob when you do a search for ‘ducks’.
In any case, if your application makes heavy use of JSON, then you should consider what it’s costing you, and figure out how to minimize that cost.


Uses of JSON

In general, JSON has three primary use cases, each of which have their own compression needs.

Statically built server-owned JSON data

These types of JSON files are typically bundled with the website, often generated from build systems.They may be used to describe a file system format, or hold templating data to send to the client. HTML5 games make heavy use of this type of data to describe their context information for maps & game stats.
These files are low-hanging-fruit for compression, as you can spend lots of offline time computing a minimal representation, allowing for the fastest distribution time to the client.

Dynamically server-built JSON data

For web applications that resemble storefronts, or social-media information exchange, a client object typically queries a server, asks for the results of some database operation, and receives the results in JSON form.
Typically, this type of JSON content is critical-path. While this content is being constructed, a user is generally watching a “loading” cursor of some kind. As such,  you may need to evaluate if it’s worth compressing the data, vs just sending it to the user quickly.

Dynamically client-built JSON data

In many cases, the browser will send JSON information back to some server, in which case compression can be useful when the JSON consists of complex data structures.
Web games are another great user of this type of information, sending player stats, and player actions to the server, resulting in multiple megabytes of data. In that direction, using automated GZIP compression is tricky, you’ll need to include a JavaScript GZIP compressor, whose size may invalidate the savings you get from the compression you’ll perform.


Each use-case has its own specific needs and problems that need solving, but in general, we can assume some basic operational costs and reduction needs can be applied between all of them.

Transposing JSON

It’s apparent that there’s lots of redundancy in JSON data, especially in highly formatted text (for example, the result of a Search  for "#perfmatters" on Twitter), where there are large collections of dictionary objects, each one listing the property name and value.


Effectively, a JSON object is made up of key-value pairs, where the “key” portion is repeated for each instance of the structure in the file, adding bloat to the file. When you have an array object, which lists a set of dictionary objects, the name of each property must be re-listed in the file for each instance of that dictionary.


{
    “name”: “alex”,
    “pos”: “AUS”,
},
{
    “name”: “Colt”,
    “pos”: “USA”,
}


For large JSON files, which list many elements in this form, the amount of overhead per each instance of “name” and “pos” contributes a great deal to the final byte-size.


If we group together all the values for each instance of the “name” property, and list them in an array, we reduce the instances of the “name” key down to one, as shown in the trivial example below


{
    “name”: [“alex”, “colt”],
    “pos”: [“AUS”, “USA”],
}


or a better diagram of the process:

How it works

To accomplish this, we search the JSON object to find any array object whose member objects are only dictionaries. Once we find this root object, we can easily transpose it.
Note that this process only works if all dictionary objects in the array have exactly the same set of property names, otherwise, we end up in an awkward position that we need to keep track of what elements had what properties somewhere in our data file which will bloat our results and complicate our decoder.



Recovering a transpose

It’s worth noting that you can modify your client-side javascript to operate directly on this transposed form of the data. This would be ideal since recovering from the transpose will take time, and extra memory to do so. If, however, updating your source code to work with the format is not an option, you’ll need to reverse the transpose once the JSON reaches destination.
To accomplish this, the encoder can attached an “IS TRANSPOSED” flag to each object that has been transformed, which on the client, we can find quickly and reverse the effects of.



Note, allowing your processed JSON data to be recoverable will decrease the compression savings you get from the  XFJSON process.  Working directly with the transformed data on the client can be a large win in terms of processing time (since you don’t need to waste cycles to decode it).


Transpose results




src
src.gz
transposed
transposed.gz
amazon.json
586,750
54,559
586,742
55,230
simple.json
1,087
402
460
344
twitter.json
100,605
17,756
100,966
17,952
videos.json
206,467
60,762
177,164
58,513
pintrest.json
12,449
1,776
12,327
1,762


The best gains from XFJSON come from larger JSON blocks with lots of repeated dictionary objects. Below that, and you can see that the post-gzip’d data is inflated.


Going further with binary.

There’s work to minimize JSON, or rather, re-represent it in a separate, less verbose format; Most of these alternatives work for their desired use case; they will require some client-side reconstruction of the data from their source format back to valid JSON.
Binary forms make sense once the data inside your file can be represented with less bits, if it were represented as a number, instead of a stream. For example, the string “3.141592653589793” takes only 8 bytes of memory if represented as a number, but in a JSON file, where numeric values are listed as human-readable strings, it would be 17 bytes long (or longer depending on your character encoding).


For this reason, there exist open binary JSON formats like BSON and MSGPACK which both validly and accurately pack JSON data into binary format for smaller delivery. These two formats contain more options and verbosity of encoding types, which may or may-not be needed for your application.


In order to be a single-file solution, XFJSON includes a simplistic binary-packer, so that you can get an idea of what types of savings can come from the transform (you can explore the savings and changes of the other formats if you find validity here). The XJFSON binary transform is quite simplistic, using the format types described by the Python JSON module, which makes it quite easy to determine if 3.141592653 is a floating point value, or a string, but ignores more advanced types like DateTime and Null.


This is a simplistic format, much like the one I used in my Binary CSS format article, so I’ll spare you the details here ;)


JSON ASCII - 59 bytes
Binary-Hex form - 41 bytes
{
   “name” : “html5rocks.com”
   “weight” : 1.7321232
}
1a00 f102 0004 006e 616d 65b0 0e00 6874
6d6c 3572 6f63 6b73 2e63 6f6d 0600 7765
6967 6874 9037 b6dd 3f



Transform + binary Results:

We plot some basic results of files in the XFJSON repository. You can see how each one stacks up against various methods of encoding, and also against GZIP .



src
src.gz
transposed
transposed.gz
trans & bin
trans & bin.gz
amazon.json
586,750
54,559
586,742
55,230
374,326
54,973
simple.json
1,087
402
460
344
434
384
twitter.json
100,605
17,756
100,966
17,952
94,829
20,000
videos.json
206,467
60,762
177,164
58,513
174,585
60,698
pintrest.json
12,449
1,776
12,327
1,762
11,399
1,836



So, binary didn’t help that much. In the few cases where the post-gzip’d binary file was smaller than the non binary file, we didn’t get as much of a savings vs zipping the source file. This isn’t a surprise, we’ve seen that GZIP does not like being handed binary data. It hates on that.


Approaching the GZIP threshold.

By this time, you should all be enabling GZIP to distribute your assets to users, which means that you need to be aware that in some cases, transforming a file can reduce the amount of entropy (or the amount of compressible data) in a file, such that GZIP fails to compress it as well as the source text data.


This is referred to as the gzip threshold , a point in which the compressed version of the modified file is larger than just compressing the source file. This demands your attention, as there’s even edge cases where crossing the gzip threshold can actually inflate your transformed data before sending it across the wire.


With this in mind, before embracing a sweeping adoption of a compression technology, make sure you test if the compression is going to be helpful on a per-file basis, and as always, optimize the places where you can have the most impact.





Moving forward



As we’ve seen, transposing JSON data can yield significant results for highly-structured data, but you have to be careful to stay on the positive side of the GZIP threshold, otherwise you’re just bloating your file, and wasting your time.


Converting your format to binary can still yield significant wins, you can really consider it a different form of minimization, where we’re tokenizing data, such that it’s recoverable. But once again, be careful to to remove too much, otherwise you’ll lose the GZIP game again.


If you’re building JSON heavy applications, then it’s worth checking out the source code Encoder to this article, that is available on my Github account. The XFJSON module which will allow you to transpose, and binary pack your JSON data using python. (I will happily accept pull requests for a javascript version ;)




~Main

You can find Colt McAnlis here:

  

3 comments:

  1. Great ! Thanks a lot for suggesting all these profitable ideas to successfully invest in binary options trading. I have been searching around for some tricks that will help me to earn some quick money and finally found it on your blog. Find More

    ReplyDelete
  2. This is absolutely excellent. What I take away from this is that gzipping is the simplest, yet effective, manner to reduce the size of your json data. To me, any further manipulation has significantly diminishing returns.

    ReplyDelete