Regex to match top level delimiters in a multi dim

2020-02-16 04:18发布

问题:

I have a file that is structured in a large multidimensional structure, similar to json, but not close enough for me to use a json library.

The data looks something like this:

alpha {
    beta {
        charlie;
    }
    delta;
}

echo;
foxtrot {
    golf;
    hotel;
}

The regex I am trying to build (for a preg_match_all) should match each top level parent (delimited by {} braces) so that I can recurse through the matches, building up a multidimensional php array that represents the data.

The first regex I tried is /(?<=\{).*(?=\})/s which greedily matches content inside braces, however this isn't quite right as when there is more than one sibling in the top level the match is too greedy. Example below:

Using regex /(?<=\{).*(?=\})/s match is given as:

Match 1:

    beta {
        charlie;
    }
    delta;
}

echo;
foxtrot {
    golf;
    hotel;

Instead the result should be: Match 1:

    beta {
        charlie;
    }
    delta;

Match 2:

    golf;
    hotel;

So regex wizards, what function am I missing here or do I need to solve this with php somehow? Any tips very welcome :)

回答1:

You can't 1 do this with regular expressions.

Alternatively, if you want to match deep-to-shallow blocks, you can use \{[^\{\}]*?\} and preg_replace_callback() to store the value, and return null to erase it from the string. The callback will need to take care of nesting the value accordingly.

$heirarchalStorage = ...;
do {
    $string = \preg_replace_callback('#\{[^\{\}]*?\}#', function($block)
    use(&$heirarchalStorage) {
        // do your magic with $heirarchalStorage
        // in here
        return null;
    }, $string);
} while (!empty($string));

Incomplete, not tested, and no warranty.

This approach requires that the string be wrapped in {} as well, otherwise the final match won't happen and you'll loop forever.

This is an awful lot of (inefficient) work for something that can just as easily be solved with a well known exchange/storage format such as JSON.

1 I was going to put "you can, but...", however I'll just say once again, "You can't" 2

2 Don't



回答2:

Sure you can do this with regular expressions.

preg_match_all(
    '/([^\s]+)\s*{((?:[^{}]*|(?R))*)}/',
    $yourStuff,
    $matches,
    PREG_SET_ORDER
);

This gives me the following in matches:

[1]=>
string(5) "alpha"
[2]=>
string(46) "
beta {
    charlie;
}
delta;
"

and

[1]=>
string(7) "foxtrot"
[2]=>
string(22) "
golf;
hotel;
"

Breaking it down a little bit.

([^\s]+)                # non-whitespace (block name)
\s*                     # whitespace (between name and block)
{                       # literal brace
    (                   # begin capture
        (?:             # don't create another capture set
            [^{}]*      # everything not a brace
            |(?R)       # OR recurse
        )*              # none or more times
    )                   # end capture
}                       # literal brace

Just for your information, this works fine on n-deep levels of braces.



回答3:

I think you might get something using preg_split by matching [a-zA-Z0-9][:blank]+{ and }. You'll be able to construct your array by going through the result. Use a recursive function which goes deeper when you match an opening tag, and upper on a closing tag.

Otherwise, cleanest solution would be to implement an ANTLR grammar !