Shorten an already short string in Java

2019-05-06 18:57发布

I'm looking for a way to shorten an already short string as much as possible.

The string is a hostname:port combo and could look like "my-domain.se:2121" or "123.211.80.4:2122".

I know regular compression is pretty much out of the question on strings this short due to the overhead needed and the lack of repetition but I have an idea of how to do it.

Because the alphabet is limited to 39 characters ([a-z][0-9]-:.) every character could fit in 6 bits. This reduce the length with up to 25% compared to ASCII. So my suggestion is somthing along these lines:

  1. Encode the string to a byte array using some kind of custom encoding
  2. Decode the byte array to a UTF-8 or ASCII string (this string will obviously not make any sense).

And then reverse the process to get the original string.

So to my questions:

  1. Could this work?
  2. Is there a better way?
  3. How?

6条回答
萌系小妹纸
2楼-- · 2019-05-06 19:17

First of all, IP addresses are designed to fit into 4 bytes and port numbers into 2. The ascii representation is only for humans to read, so it doesn't make sense to do compression on that.

Your idea for compressing domain name strings is doable.

查看更多
一夜七次
3楼-- · 2019-05-06 19:28

Well in your case, I would use a specialized algo for your usecase. Recognize that you can store something other than strings. So for a IPv4 address : port, you would have a class that captured 6 bytes -- 4 for the address and 2 for the port. Another for type for apha-numeric hostnames. The port would always be stored in two bytes. The hostname part itself could also have specialized support for .com, for example. So a sample hierarchy may be:

    HostPort
       |
  +----+--------+
  |             |
IPv4        HostnamePort
                |
           DotComHostnamePort


public interface HostPort extends CharSequence { }


public HostPorts {
  public static HostPort parse(String hostPort) {
    ...
  }
}

In this case, the DotComHostnamePort allows you to drop .com from the host name and save 4 chars/bytes, depending on whether you store hostnames in punyform or in UTF16 form.

查看更多
祖国的老花朵
4楼-- · 2019-05-06 19:31

What you are suggesting is similar to base 64 encoding/decoding and there might be some mileage in looking at some of those implementations (base 64 encoding uses 6 bits).

As a starter if you use Apaches base 64 library

String x = new String(Base64.decodeBase64("my-domain.se:2121".getBytes()));
String y = new String(Base64.encodeBase64(x.getBytes()));
System.out.println("x = " + x);
System.out.println("y = " + y);

It will shorten your string by a few chars. This obviously does not work as what you end up with is not what you started with.

查看更多
趁早两清
5楼-- · 2019-05-06 19:32

You could encode the string as base 40 which is more compact than base 64. This will give you 12 such tokens into a 64 bit long. The 40th token could be the end of string marker to give you the length (as it will not be a whole number of bytes any more)

If you use arithmetic encoding, it could be much smaller but you would need a table of frequencies for each token. (using a long list of possible examples)

class Encoder {
  public static final int BASE = 40;
  StringBuilder chars = new StringBuilder(BASE);
  byte[] index = new byte[256];

  {
    chars.append('\0');
    for (char ch = 'a'; ch <= 'z'; ch++) chars.append(ch);
    for (char ch = '0'; ch <= '9'; ch++) chars.append(ch);
    chars.append("-:.");
    Arrays.fill(index, (byte) -1);
    for (byte i = 0; i < chars.length(); i++)
      index[chars.charAt(i)] = i;
  }

  public byte[] encode(String address) {
    try {
      ByteArrayOutputStream baos = new ByteArrayOutputStream();
      DataOutputStream dos = new DataOutputStream(baos);
      for (int i = 0; i < address.length(); i += 3) {
        switch (Math.min(3, address.length() - i)) {
          case 1: // last one.
            byte b = index[address.charAt(i)];
            dos.writeByte(b);
            break;

          case 2:
            char ch = (char) ((index[address.charAt(i+1)]) * 40 + index[address.charAt(i)]);
            dos.writeChar(ch);
            break;

          case 3:
            char ch2 = (char) ((index[address.charAt(i+2)] * 40 + index[address.charAt(i + 1)]) * 40 + index[address.charAt(i)]);
            dos.writeChar(ch2);
            break;
        }
      }
      return baos.toByteArray();
    } catch (IOException e) {
      throw new AssertionError(e);
    }
  }

  public static void main(String[] args) {
    Encoder encoder = new Encoder();
    for (String s : "twitter.com:2122,123.211.80.4:2122,my-domain.se:2121,www.stackoverflow.com:80".split(",")) {
      System.out.println(s + " (" + s.length() + " chars) encoded is " + encoder.encode(s).length + " bytes.");
    }
  }
}

prints

twitter.com:2122 (16 chars) encoded is 11 bytes.
123.211.80.4:2122 (17 chars) encoded is 12 bytes.
my-domain.se:2121 (17 chars) encoded is 12 bytes.
www.stackoverflow.com:80 (24 chars) encoded is 16 bytes.

I leave decoding as an exercise. ;)

查看更多
Evening l夕情丶
6楼-- · 2019-05-06 19:33

The first two bytes could contain the port number. If you always start with this fixed length port number, you don't need to include the separator :. Instead use a bit that indicates whether an IP address follows (see Karl Bielefeldt's solution) or a host name.

查看更多
Evening l夕情丶
7楼-- · 2019-05-06 19:39

You could encode them using the CDC Display code. This encoding was used back in the old days when bits were scarce and programmers were nervous.

查看更多
登录 后发表回答