Say I have a single-page application that uses a third party API for content. The app’s logic is in-browser only, and there is no backend I can write to.
To allow deep-linking into the state of the app, I use pushState to keep track of a few variables that determine the state of the app (note that Ubersicht’s public version doesn’t do this yet). In this case repos
, labels
, milestones
and username
, show_open
(bool) and with_comments
(bool) and without_comments
(bool). The URL format is ?label=label_1,label_2,label_3&repos=repo_1…
. Values are the usual suspects, roughly [a-zA-Z][a-zA-Z0-9_-]
, or any boolean indicator.
So far so good. Now since the query string can be a bit long and unwieldy and I would like to be able to pass around URLs like http://espy.github.io/ubersicht/?state=SOMOPAQUETOKENTHATLOSSLESSLYDECOMPRESSESINTOTHEORIGINALVALUES#hoodiehq
, the shorter the better.
My first attempt was going to be using some zlib like algorithm for this (https://github.com/imaya/zlib.js) and @flipzagging pointed to antirez/smaz (https//github.com/antirez/smaz) which sounds more suitable for short strings (JavaScript version at https://github.com/personalcomputer/smaz.js).
Since =
and &
are not specifically handled in https://github.com/personalcomputer/smaz.js/blob/master/lib/smaz.js#L9, we might be able to tweak things a little there.
Furthermore, there is an option for encoding the values in a fixed table, e.g. the order of arguments is pre-defined and all we need to keep track of is the actual value. E.g. turn a=hamster&b=cat
into 7hamster3cat
(length+chars)or hamster|cat (value + |
), potentially before the smaz compression.
Is there anything else I should be looking for?
Maybe any simple JS minifier will help you. You'll need only to integrate it on serialization and deserialization points only. I think it'd be the easiest solution.
There are two main aspects to the problem: encoding and compression.
General purpose compression doesn't seem to work well on small strings. As browsers don't provide any api to compress string, you also need to load the source, which can be huge.
Lot of characters can be saved by using an efficient encoding. I have written a library named μ to handle the encoding and decoding part. The idea is to specify as much as information available about the structure and domain of the url parameters as a specification. This specification can be then used to drive the encoding and decoding. For example, boolean can be encoded using just one bit, integer can be converted to different base(64) thereby reducing the number of characters required, object keys need not be encoded because it can be inferred from the specification, enums can be encoded using log2(numberOfAllowedValues) bits.
Small tip: Both
parseInt
andNumber#toString
support radix arguments. Try using a radix of 36 to encode numbers (or indexes into lists) in URLs.I have a cunning plan! (And a drink of gin tonic)
You doesn't seem to care about the length of the bytestream but of the length of the resulting glyphs, e.g. what the string which is displayed to the user.
Browser are pretty good in converting an IRI to the underlying [URI][2] while still displaying the IRI in the address bar. IRIs have a greater repertoire of possible characters while your set of possible chars is rather limited.
That means you can encode bigrams of your chars (aa, ab, ac, …, zz & special chars) into one char of the full unicode spectrum. Say you've got 80 possible ASCII chars: the number of possible combinations of two chars is 6400. Which are easy findable in Unicodes assigned chars, e.g. in the han unified CJK spectrum:
I picked CJK because this is only (slighty) reasonable if the target chars are assigned in unicode and have assigned glyphs on the major browser and operating systems. For that reason the private use area is out and the more efficient version using trigrams (whose possible combinations could use all of Unicodes 1114112 possible code points) are out.
To recap: the underlying bytes are still there and – given UTF-8 encoding – possible even longer, but the string of displayed characters the user sees and copies is 50% shorter.
Ok, Ok, reasons, why this solution is insane:
IRIs are not perfect. A lot of lesser tools than modern browser have their problems.
The algorithm needs obviously a lot of more work. You'll need a function which maps the bigrams to the target chars and back. And it should preferable work arithmetically to avoid big hash tables in memory.
The target chars should be checked if they are assigned and if they are simple chars and not fancy unicodian things like combining chars or stuff that got lost somewhere in Unicode normalization. Also if the target area is an continuous span of assigned chars with glyphs.
Browser are sometimes wary of IRIs. For good reason, given the IDN homograph attacks. Are they OK with all these non-ASCII-chars in their address bar?
And the biggest: people are notoriously bad at remembering characters in scripts they don't know. They are even worse at trying to (re)-type these chars. And copy'n'paste can go wrong in many different clicks. There is a reason URL shorteners use Base64 and even smaller alphabets.
… speaking of which: That would be my solution. Offloading the work of shortening links either to the user or integrating goo.gl or bit.ly via their APIs.
It looks like the Github APIs have numeric IDs for many things (looks like repos and users have them, but labels don't) under the covers. It might be possible to use those numbers instead of names wherever advantageous. You then have to figure out how to best encode those in something that'll survive in a query string, e.g. something like base64(url).
For example, your hoodie.js repository has ID
4780572
.Packing that into a big-endian unsigned int (as many bytes as we need) gets us
\x00H\xf2\x1c
.We'll just toss the leading zero, we can always restore that later, now we have
H\xf2\x1c
.Encode as URL-safe base64, and you have
SPIc
(toss any padding you might get).Going from
hoodiehq/hoodie.js
toSPIc
seems like a good-sized win!More generally, if you're willing to invest the time, you can try to exploit a bunch of redudancies in your query strings. Other ideas are along the lines of packing the two boolean params into a single character, possibly along with other state (like what fields are included). If you use base64-encoding (which seems the best option here due to the URL-safe version -- I looked at base85, but it has a bunch of characters that won't survive in a URL), that gets you 6 bits of entropy per character... there's a lot you can do with that.
To add to Thomas Fuchs' note, yes, if there's some kind of inherent, immutable ordering in some of things you're encoding, than that would obviously also help. However, that seems hard for both the labels and the milestones.
Why not use a third party link shortener?
(I am assuming you don't have a problem with URI length limits since you mentioned this is an existing application.)
It looks like you're writing a Greasemonkey script or thereabouts, so perhaps you have access to GM_xmlhttpRequest(), which would allow use of a third party link shortener.
Otherwise, you'd need to use XMLHttpRequest() and host your own link shortening service on the same server to avoid crossing the same-origin policy boundary. A quick online search for hosting your own shorteners supplied me with a list of 7 free/open source PHP link shortener scripts and one more on GitHub, though the question likely excludes this kind of approach since "The app’s logic is in-browser only, and there is no backend I can write to."
You can see example code implementing this kind of thing in the URL Shortener UserScript (for Greasemonkey), which pops up a shortened version of the current page's URL when you press SHIFT+T.
Of course, shorteners will redirect users to the long form URL, but this would be a problem in any non-server-side solution. At least a shortener can theoretically proxy (like Apache's RewriteRule with [P]) or use a <frame> tag.