I'm looking for a way to shorten an already short string as much as possible.
The string is a hostname:port combo and could look like "my-domain.se:2121" or "123.211.80.4:2122".
I know regular compression is pretty much out of the question on strings this short due to the overhead needed and the lack of repetition but I have an idea of how to do it.
Because the alphabet is limited to 39 characters ([a-z][0-9]-:.) every character could fit in 6 bits. This reduce the length with up to 25% compared to ASCII. So my suggestion is somthing along these lines:
- Encode the string to a byte array using some kind of custom encoding
- Decode the byte array to a UTF-8 or ASCII string (this string will obviously not make any sense).
And then reverse the process to get the original string.
So to my questions:
- Could this work?
- Is there a better way?
- How?
First of all, IP addresses are designed to fit into 4 bytes and port numbers into 2. The ascii representation is only for humans to read, so it doesn't make sense to do compression on that.
Your idea for compressing domain name strings is doable.
Well in your case, I would use a specialized algo for your usecase. Recognize that you can store something other than strings. So for a IPv4 address : port, you would have a class that captured 6 bytes -- 4 for the address and 2 for the port. Another for type for apha-numeric hostnames. The port would always be stored in two bytes. The hostname part itself could also have specialized support for
.com
, for example. So a sample hierarchy may be:In this case, the DotComHostnamePort allows you to drop
.com
from the host name and save 4 chars/bytes, depending on whether you store hostnames in punyform or in UTF16 form.What you are suggesting is similar to base 64 encoding/decoding and there might be some mileage in looking at some of those implementations (base 64 encoding uses 6 bits).
As a starter if you use Apaches base 64 library
It will shorten your string by a few chars. This obviously does not work as what you end up with is not what you started with.
You could encode the string as base 40 which is more compact than base 64. This will give you 12 such tokens into a 64 bit long. The 40th token could be the end of string marker to give you the length (as it will not be a whole number of bytes any more)
If you use arithmetic encoding, it could be much smaller but you would need a table of frequencies for each token. (using a long list of possible examples)
prints
I leave decoding as an exercise. ;)
The first two bytes could contain the port number. If you always start with this fixed length port number, you don't need to include the separator
:
. Instead use a bit that indicates whether an IP address follows (see Karl Bielefeldt's solution) or a host name.You could encode them using the CDC Display code. This encoding was used back in the old days when bits were scarce and programmers were nervous.