Say I have a single-page application that uses a third party API for content. The app’s logic is in-browser only, and there is no backend I can write to.
To allow deep-linking into the state of the app, I use pushState to keep track of a few variables that determine the state of the app (note that Ubersicht’s public version doesn’t do this yet). In this case repos
, labels
, milestones
and username
, show_open
(bool) and with_comments
(bool) and without_comments
(bool). The URL format is ?label=label_1,label_2,label_3&repos=repo_1…
. Values are the usual suspects, roughly [a-zA-Z][a-zA-Z0-9_-]
, or any boolean indicator.
So far so good. Now since the query string can be a bit long and unwieldy and I would like to be able to pass around URLs like http://espy.github.io/ubersicht/?state=SOMOPAQUETOKENTHATLOSSLESSLYDECOMPRESSESINTOTHEORIGINALVALUES#hoodiehq
, the shorter the better.
My first attempt was going to be using some zlib like algorithm for this (https://github.com/imaya/zlib.js) and @flipzagging pointed to antirez/smaz (https//github.com/antirez/smaz) which sounds more suitable for short strings (JavaScript version at https://github.com/personalcomputer/smaz.js).
Since =
and &
are not specifically handled in https://github.com/personalcomputer/smaz.js/blob/master/lib/smaz.js#L9, we might be able to tweak things a little there.
Furthermore, there is an option for encoding the values in a fixed table, e.g. the order of arguments is pre-defined and all we need to keep track of is the actual value. E.g. turn a=hamster&b=cat
into 7hamster3cat
(length+chars)or hamster|cat (value + |
), potentially before the smaz compression.
Is there anything else I should be looking for?
Why not using protocol-buffers?
ProtoBuf.js converts objects to protocol buffer messages and vice vera.
The following object converts to:
CgFhCgFiCgFjEgFkEgFlEgFmGgFnGgFoGgFpIgNqZ2I=
Example
The following example is built using require.js. Give it a try on this jsfiddle.
Update: I released an NPM package with some more optimizations, see https://www.npmjs.com/package/@yaska-eu/jsurl2
Some more tips:
a..zA..Z0..9+/=
, and un-encoded URI characters area..zA..Z0..9-_.~
. So Base64 results only need to swap+/=
for-_.
and it won't expand URIs.{foo:3,bar:{g:'hi'}}
becomesa3,b{c'hi'}
given key array['foo','bar','g']
Interesting libraries:
{"name":"John Doe","age":42,"children":["Mary","Bill"]}
becomes~(name~'John*20Doe~age~42~children~(~'Mary~'Bill))
and with a key dictionary['name','age','children']
that could be~(0~'John*20Doe~1~42~2~(~'Mary~'Bill))
, thus going from 101 bytes URI encoded to 38.compressToEncodedURIComponent()
function to produce URI-safe output.So basically I'd recommend picking one of these two libraries and consider the problem solved.
Perhaps you can find a url shortener with a jsonp API, that way you could make all the URLs really short automatically.
http://yourls.org/ even has jsonp support.
Short
Use a URL packing scheme such as my own, starting only from the params section of your URL.
Longer
As other's here have pointed out, typical compression systems don't work for short strings. But, it's important to recognise that URLs and Params are a serialization format of a data model: a text human-readable format with specific sections - we know the the scheme is first, the host is found directly after, the port is implied but can be overridden, etc...
With the original data model, one can serialize with a more bit-efficient serialization scheme. In fact, I have created such a serialization myself which archives around 50% compression: see http://blog.alivate.com.au/packed-url/
Just as you yourself propose, I would first get rid of all the characters that are not carrying any information, because they are part of the "format".
E.g. turn "labels=open,ssl,cypher&repository=275643&username=ryanbrg&milestones=&with_comment=yes" to "open,ssl,cyper|275643|ryanbrg||yes".
Then use a Huffmann encoding with a fixed probability vector (resulting in a fixed mapping from characters to variable length bitstrings - with the most probable characters mapped to shorter bitstrings and less probable characters mapped to longer bitstrings).
You could even use different probability vectors for the different parameters. For example in the parameter "labels" the alpha characters will have high probability, but in the "repository" parameter the numeric characters will have the highest probability. If you do this, you should consider the separator "|" a part of the preceeding parameter.
And finally turn the long bitstring (which is the concatenation all the bitstrings to which the characters were mapped) into something you can put into an URL by base64url encoding it.
If you could send me a set of representative parameter lists, I could run them through a Huffmann coder to see how well they compress.
The probability vector (or equivalently the mapping from characters to bitstrings) should be encoded as constant arrays into the Javascript function that is sent to the browser.
Of course you could go even further and - for example - try to get a list of possible lables with their probabilities. Then you could map entire lables to bitstrings with a Huffmann encoding. This will give you better compression, but you will have extra work for those labels that are new (e.g. falling back to the single character encoding), and of course the mapping (which - as mentioned above - is a constant array in the Javascript function) will be much larger.
A working solution putting various bits of good (or so I think) ideas together
I did this for fun, mainly because it gave me an opportunity to implement an Huffman encoder in PHP and I could not find a satisfactory existing implementation.
However, this might save you some time if you plan to explore a similar path.
Burrows-Wheeler+move-to-front+Huffman transform
I'm not quite sure BWT would be best suited for your kind of input.
This is no regular text, so recurring patterns would probably not occur as often as in source code or plain English.
Besides, a dynamic Huffman code would have to be passed along with the encoded data which, for very short input strings, would harm the compression gain badly.
I might well be wrong, in which case I would gladly see someone prove me to be.
Anyway, I decided to try another approach.
General principle
1) define a structure for your URL parameters and strip the constant part
for instance, starting from:
extract:
where
,
and|
act as string and/or field terminators, while boolean values don't need any.2) define a static repartition of symbols based on the expected average input and derive a static Huffman code
Since transmitting a dynamic table would take more space than your initial string, I think the only way to achhieve any compression at all is to have a static huffman table.
However, you can use the structure of your data to your advantage to compute reasonable probabilities.
You can start with the repartition of letters in English or other languages and throw in a certain percentage of numbers and other punctuation signs.
Testing with a dynamic Huffman coding, I saw compression rates of 30 to 50%.
This means with a static table you can expect maybe a .6 compression factor (reducing the lenght of your data by 1/3), not much more.
3) convert this binary Huffmann code into something an URI can handle
The 70 regular ASCII 7 bits chars in that list
would give you an expansion factor of about 30%, practically no better than a base64 encode.
A 30% expansion would ruin the gain from a static Huffman compression, so this is hardly an option!
However, since you control the encoding client and server side, you can use about anything that is not an URI reserved character.
An interesting possiblity would be to complete the above set up to 256 with whatever unicode glyphs, which would allow to encode your binary data with the same number of URI-compliant characters, thus replacing a painful and slow bunch of long integer divisions with a lightning fast table lookup.
Structure description
The codec is meant to be used both client and server side, so it is essential that server and clients share a common data structure definition.
Since the interface is likely to evolve, it seems wise to store a version number for upward compatibility.
The interface definition will use a very minimalistic description language, like so:
Each language supported will have a frequency table for all its used letters
digits and other computerish symbols like
-
,.
or_
will have a global frequency, regardless of languagesseparators (
,
and|
) frequencies will be computed according to the number of lists and fields present in the structure.All other "foreign" characters will be escaped with a specific code and encoded as plain UTF-8.
Implementation
The bidirectional conversion path is as follows:
list of fields <-> UTF-8 data stream <-> huffman codes <-> URI
Here is the main codec
The underlying Huffman codec
And the huffman dictionary
Example
output:
In that example, the input got packed into 64 unicode characters, for an input length of about 100, yielding a 1/3 reduction.
An equivalent string:
Would be compressed by a dynamic Huffman table to 59 characters. Not much of a difference.
No doubt smart data reordering would reduce that, but then you would need to pass the dynamic table along...
Chinese to the rescue?
Drawing on ttepasse's idea, one could take advantage of the huge number of Asian characters to find a range of 0x4000 (12 bits) contiguous values, to code 3 bytes into 2 CJK characters, like so:
and back:
The previous output of 64 Latin chars
would "shrink" to 42 Asian characters:
However, as you can see, the sheer bulk of your average ideogram makes the string actually longer (pixel-wise), so even if the idea was promising, the outcome is rather disappointing.
Picking thinner glyphs
On the other hand, you can try to pick "thin" characters as a base for URI encoding. For instance:
instead of
That will shrink the length by half with proportional fonts, including in a browser address bar.
My best candidate set of 256 "thin" glyphs so far:
Conclusion
This implementation should be ported to JavaScript to allow client-server exchange.
You should also provide a way to share the structure and Huffman codes with the clients.
It is not difficult and rather fun to do, but that means even more work :).
The Huffman gain in term of characters is around 30%.
Of course these characters are multibyte for the most part, but if you aim for the shortest URI it does not matter.
Except for the booleans that can easily be packed to 1 bit, those pesky strings seem rather reluctant to be compressed.
It might be possible to better tune the frequencies, but I doubt you will get above 50% compression rate.
On the other hand, picking thin glyphs does actually more to shrink the string.
So all in all the combination of both might indeed achieve something, though it's a lot of work for a modest result.