Lossless hierarchical run length encoding

2019-02-04 18:29发布


I want to summarize rather than compress in a similar manner to run length encoding but in a nested sense.

For instance, I want : ABCBCABCBCDEEF to become: (2A(2BC))D(2E)F

I am not concerned that an option is picked between two identical possible nestings E.g.

ABBABBABBABA could be (3ABB)ABA or A(3BBA)BA which are of the same compressed length, despite having different structures.

However I do want the choice to be MOST greedy. For instance:

ABCDABCDCDCDCD would pick (2ABCD)(3CD) - of length six in original symbols which is less than ABCDAB(4CD) which is length 8 in original symbols.

In terms of background I have some repeating patterns that I want to summarize. So that the data is more digestible. I don't want to disrupt the logical order of the data as it is important. but I do want to summarize it , by saying, symbol A times 3 occurrences, followed by symbols XYZ for 20 occurrences etc. and this can be displayed in a nested sense visually.

Welcome ideas.


I'm pretty sure this isn't the best approach, and depending on the length of the patterns, might have a running time and memory usage that won't work, but here's some code.

You can paste the following code into LINQPad and run it, and it should produce the following output:


As you can see, the middle example encoded ABB as A(2B) instead of ABB, you would have to make that judgment yourself, if single-symbol sequences like that should be encoded as a repeated symbol or not, or if a specific threshold (like 3 or more) should be used.

Basically, the code runs like this:

  1. For each position in the sequence, try to find the longest match (actually, it doesn't, it takes the first 2+ match it finds, I left the rest as an exercise for you since I have to leave my computer for a few hours now)
  2. It then tries to encode that sequence, the one that repeats, recursively, and spits out a X*seq type of object
  3. If it can't find a repeating sequence, it spits out the single symbol at that location
  4. It then skips what it encoded, and continues from #1

Anyway, here's the code:

void Main()
    string[] examples = new[]

    foreach (string example in examples)
        StringBuilder sb = new StringBuilder();
        foreach (var r in Encode(example))
        Debug.WriteLine(example + " = " + sb.ToString());

public static IEnumerable<Repeat<T>> Encode<T>(IEnumerable<T> values)
    return Encode<T>(values, EqualityComparer<T>.Default);

public static IEnumerable<Repeat<T>> Encode<T>(IEnumerable<T> values, IEqualityComparer<T> comparer)
    List<T> sequence = new List<T>(values);

    int index = 0;
    while (index < sequence.Count)
        var bestSequence = FindBestSequence<T>(sequence, index, comparer);
        if (bestSequence == null || bestSequence.Length < 1)
            throw new InvalidOperationException("Unable to find sequence at position " + index);

        yield return bestSequence;
        index += bestSequence.Length;

private static Repeat<T> FindBestSequence<T>(IList<T> sequence, int startIndex, IEqualityComparer<T> comparer)
    int sequenceLength = 1;
    while (startIndex + sequenceLength * 2 <= sequence.Count)
        if (comparer.Equals(sequence[startIndex], sequence[startIndex + sequenceLength]))
            bool atLeast2Repeats = true;
            for (int index = 0; index < sequenceLength; index++)
                if (!comparer.Equals(sequence[startIndex + index], sequence[startIndex + sequenceLength + index]))
                    atLeast2Repeats = false;
            if (atLeast2Repeats)
                int count = 2;
                while (startIndex + sequenceLength * (count + 1) <= sequence.Count)
                    bool anotherRepeat = true;
                    for (int index = 0; index < sequenceLength; index++)
                        if (!comparer.Equals(sequence[startIndex + index], sequence[startIndex + sequenceLength * count + index]))
                            anotherRepeat = false;
                    if (anotherRepeat)

                List<T> oneSequence = Enumerable.Range(0, sequenceLength).Select(i => sequence[startIndex + i]).ToList();
                var repeatedSequence = Encode<T>(oneSequence, comparer).ToArray();
                return new SequenceRepeat<T>(count, repeatedSequence);


    // fall back, we could not find anything that repeated at all
    return new SingleSymbol<T>(sequence[startIndex]);

public abstract class Repeat<T>
    public int Count { get; private set; }

    protected Repeat(int count)
        Count = count;

    public abstract int Length

public class SingleSymbol<T> : Repeat<T>
    public T Value { get; private set; }

    public SingleSymbol(T value)
        : base(1)
        Value = value;

    public override string ToString()
        return string.Format("{0}", Value);

    public override int Length
            return Count;

public class SequenceRepeat<T> : Repeat<T>
    public Repeat<T>[] Values { get; private set; }

    public SequenceRepeat(int count, Repeat<T>[] values)
        : base(count)
        Values = values;

    public override string ToString()
        return string.Format("({0}{1})", Count, string.Join("", Values.Select(v => v.ToString())));

    public override int Length
            int oneLength = 0;
            foreach (var value in Values)
                oneLength += value.Length;
            return Count * oneLength;

public class GroupRepeat<T> : Repeat<T>
    public Repeat<T> Group { get; private set; }

    public GroupRepeat(int count, Repeat<T> group)
        : base(count)
        Group = group;

    public override string ToString()
        return string.Format("({0}{1})", Count, Group);

    public override int Length
            return Count * Group.Length;


Looking at the problem theoretically, it seems similar to the problem of finding the smallest context free grammar which generates (only) the string, except in this case the non-terminals can only be used in direct sequence after each other, so e.g.



Of course, this depends on how you define "smallest", but if you count terminals on the right side of rules, it should be the same as the "length in original symbols" after doing the nested run-length encoding.

The problem of the smallest grammar is known to be hard, and is a well-studied problem. I don't know how much the "direct sequence" part adds to or subtracts from the complexity.