Efficient algorithm for converting a character set

2019-04-06 07:17发布

问题:

I'm currently working on a scanner generator. The generator already works fine. But when using character classes the algorithm gets very slow.

The scanner generator produces a scanner for UTF8 encoded files. The full range of characters (0x000000 to 0x10ffff) should be supported.

If I use large character sets, like the any operator '.' or the unicode property {L}, the nfa (and also the dfa) contains a lot of states ( > 10000 ). So the convertation for nfa to dfa and create the minimal dfa takes a long time (even if the output minimal dfa contains only a few states).

Here's my current implementation of creating a character set part of the nfa.

void CreateNfaPart(int startStateIndex, int endStateIndex, Set<int> characters)
{
transitions[startStateIndex] = CreateEmptyTransitionsArray();
foreach (int character in characters) {
    // get the utf8 encoded bytes for the character
    byte[] encoded = EncodingHelper.EncodeCharacter(character);
    int tStartStateIndex = startStateIndex;
    for (int i = 0; i < encoded.Length - 1; i++) {
        int tEndStateIndex = transitions[tStartStateIndex][encoded[i]];
        if (tEndStateIndex == -1) {
           tEndStateIndex = CreateState();
               transitions[tEndStateIndex] = CreateEmptyTransitionsArray();
        }                   
        transitions[tStartStateIndex][encoded[i]] = tEndStateIndex;
        tStartStateIndex = tEndStateIndex;
    }
    transitions[tStartStateIndex][encoded[encoded.Length - 1]] = endStateIndex;
}

Does anyone know how to implement the function much more efficiently to create only the necessary states?

EDIT:

To be more specific I need a function like:

List<Set<byte>[]> Convert(Set<int> characters)
{
     ???????
}

A helper function to convert a character (int) to a UTF8 encoding byte[] is defined as:

byte[] EncodeCharacter(int character)
{ ... }

回答1:

There are a number of ways to handle it. They all boil down to treating sets of characters at a time in the data structures, instead of enumerating the entire alphabet ever at all. It's also how you make scanners for Unicode in a reasonable amount of memory.

You've many choices about how to represent and process sets of characters. I'm presently working with a solution that keeps an ordered list of boundary conditions and corresponding target states. You can process operations on these lists much faster than you could if you had to scan the entire alphabet at each juncture. In fact, it's fast enough that it runs in Python with acceptable speed.



回答2:

Look at what regular expression libraries like Google RE2 and TRE are doing.



回答3:

I had the same problem with my scanner generator, so I've come up with the idea of replacing intervals by their ids which is determined using interval tree. For instance a..z range in dfa can be represented as: 97, 98, 99, ..., 122, instead I represent ranges as [97, 122], then build interval tree structure out of them, so at the end they are represented as ids that is referring to the interval tree. Given the following RE: a..z+, we end up with such DFA:

0 -> a -> 1
0 -> b -> 1
0 -> c -> 1
0 -> ... -> 1
0 -> z -> 1

1 -> a -> 1
1 -> b -> 1
1 -> c -> 1
1 -> ... -> 1
1 -> z -> 1
1 -> E -> ACCEPT

Now compress intervals:

0 -> a..z -> 1

1 -> a..z -> 1
1 -> E -> ACCEPT

Extract all intervals from your DFA and build interval tree out of them:

{
    "left": null,
    "middle": {
        id: 0,
        interval: [a, z],
    },
    "right": null
}

Replace actual intervals to their ids:

0 -> 0 -> 1
1 -> 0 -> 1
1 -> E -> ACCEPT


回答4:

In this library (http://mtimmerm.github.io/dfalex/) I do it by putting a range of consecutive characters on each transition, instead of single characters. This is carried through all the steps of NFA constuction, NFA->DFA conversion, DFA minimization, and optimization.

It's quite compact, but it adds code complexity to every step.