recursively parse custom markup

2019-08-31 01:28发布

I must handle an already existing custom markup language (which is ugly, but unfortunately can not be altered because I'm handling legacy data and it needs to stay compatible with a legacy app).

I need to parse command "ranges", and depending on the action taken by the user either replace these "ranges" in the data with something else (HTML or LaTeX code) or entirely remove these "ranges" from the input.

My current solution solution is using preg_replace_callback() in a loop until there are no matches left, but it is utterly slow for huge documents. (i.e. ~7 seconds for 394 replacements in a 57 KB document)

Recursive regular expressions don't seem to be flexible enough for this task, as i need to access all matches, even in recursion.

Question: How could i improve the performance of my parsing?

Regular expressions may be completely removed - they are not a requirement but the only thing i could come up with.

Note: The code example below is heavily reduced. (SSCCE) Actually there are many different "types" of ranges and the closure function does different things depending on the mode of operation. (insert values from DB, remove entire ranges, convert to another format, etc..) Please keep this in mind!

Example of what I'm currently doing:

<?php
$data = <<<EOF
some text 1
begin-command
    some text 2
    begin-command
        some text 3
    command-end
    some text 4
    begin-command-if "%VAR%" == "value"
        some text 5
        begin-command
            some text 6
        command-end
    command-end
command-end

EOF;

$regex = '~
    # opening tag
    begin-(?P<type>command(?:-if)?)
    # must not contain a nested "command" or "command-if" command!
    (?!.*begin-command(?:-if)?.*command(?:-if)?-end)
    # the parameters for "command-if" are optional
    (?:
        [\s\n]*?
        (?:")[\s\n]*(?P<leftvalue>[^\\\\]*?)[\s\n]*(?:")
        [\s\n]*
        # the operator is optional
        (?P<operator>[=<>!]*)
        [\s\n]*
        (?:")[\s\n]*(?P<rightvalue>[^\\\\]*?)[\s\n]*(?:")
        [\s\n]*?
    )?
    # the real content
    (?P<content>.*?)
    # closing tag
    command(?:-if)?-end
 ~smx';

$counter = 0;
$loop_replace = true;
while ($loop_replace) {
    $data = preg_replace_callback($regex, function ($matches) use ($counter) {
        global $counter;
        $counter++;
        return "<command id='{$counter}'>{$matches['content']}</command>";
    }, $data, -1, $loop_replace);
}
echo $data;

2条回答
ら.Afraid
2楼-- · 2019-08-31 02:10

I've completely removed the regular expressions for parsing now. I realized that actually the raw input can be seen as a XML markup tree in some kind of weird representation.

Instead of using regular expressions, i now do the following:

  1. Replace everything which could be interpreted as XML with a text representation (using XML entities)
  2. Replace all begin-command ... command-end blocks with corresponding XML tags
    (Note there are actually several different commands)
  3. Let a real parser (XML DOM) handle the markup tree
  4. Iterate over the DOM recursively
  5. For each Node, execute the appropriate action, depending on the mode of operation

This seems ugly, but i really didn't want to write my own parser - that seemed a bit "overkill" in the limited time i have for improving the speed. And oh boy, that is still blazing fast - much faster than the RegExp solution. Impressive, when you consider the overhead converting the raw input to valid XML and back.

With "blazing fast" i mean it now takes a mere ~200ms for a document which previously needed 5-7 seconds to parse with several Regular Expressions.

Here is the code I'm using now:

// convert raw input to valid XML representation
$data = str_replace(
    array('<', '>', '&'), 
    array('&lt;', '&gt;', '&amp;'), 
    $data
);
$data = preg_replace(
    '!begin-(command|othercommand|morecommand)(?:-(?P<options>\S+))?!', 
    '<\1 options="\2">', 
    $data
);
$data = preg_replace(
    '!(command|othercommand|morecommand)-end!', 
    '</\1>', 
    $data
);

// use DOM to parse XML representation
$dom = new \DOMDocument();  
$dom->loadXML("<?xml version='1.0' ?>\n<document>".$data.'</document>');
$xpath = new \DOMXPath($dom);

// iterate over DOM, recursively replace commands with conversion results
foreach($xpath->query('./*') as $node) {
    if ($node->nodeType == XML_ELEMENT_NODE)
        convertNode($node, 'form', $dom, $xpath);
}

// convert XML DOM back to raw format
$data = $dom->saveXML();
$data = substr($data, strpos($data, "<document>")+10, -12);
$data = str_replace(
    array('&amp;', '&lt;', '&gt;'), 
    array('&', '<', '>'), 
    $data
);

// output the stuff
echo $data;

function convertNode (\DomNode $node, $output_mode, $dom, $xpath) {
    $type = $node->tagName;
    $children = $xpath->query('./*', $node);

    // recurse over child nodes
    foreach ($children as $childNode) {
        if ($childNode->nodeType == XML_ELEMENT_NODE) {
            convertNode($childNode, $output_mode, $dom, $xpath);
        }
    }

    // in production code, here is actual logic
    // to process the several command types
    $newNode = $dom->createTextNode(
        "<$type>" 
        . $node->textContent
        . "</$type>"
    );

    // replace node with command result
    if ($node->parentNode) {
        $node->parentNode->replaceChild($newNode, $node);
        // just to be sure - normalize parent node
        $newNode->parentNode->normalize();
    } 
}
查看更多
对你真心纯属浪费
3楼-- · 2019-08-31 02:14

Your look ahead on the 4th line of your regex:

(?!.*begin-command(?:-if)?.*command(?:-if)?-end)

this will have to read to the end of your file every time it is encountered (with the modifiers that are being used)

making your .*'s lazy may earn you a bit of a performance boost on those large files:

(?!.*?begin-command(?:-if)?.*?command(?:-if)?-end)

also if the (?:-if)? is always going to come after begin-command you can just get rid of it there, would make it something like:

(?!.*?begin-command.*?command(?:-if)?-end)  
查看更多
登录 后发表回答