I must handle an already existing custom markup language (which is ugly, but unfortunately can not be altered because I'm handling legacy data and it needs to stay compatible with a legacy app).
I need to parse command "ranges", and depending on the action taken by the user either replace these "ranges" in the data with something else (HTML or LaTeX code) or entirely remove these "ranges" from the input.
My current solution solution is using preg_replace_callback()
in a loop until there are no matches left, but it is utterly slow for huge documents. (i.e. ~7 seconds for 394 replacements in a 57 KB document)
Recursive regular expressions don't seem to be flexible enough for this task, as i need to access all matches, even in recursion.
Question: How could i improve the performance of my parsing?
Regular expressions may be completely removed - they are not a requirement but the only thing i could come up with.
Note: The code example below is heavily reduced. (SSCCE) Actually there are many different "types" of ranges and the closure function does different things depending on the mode of operation. (insert values from DB, remove entire ranges, convert to another format, etc..) Please keep this in mind!
Example of what I'm currently doing:
<?php
$data = <<<EOF
some text 1
begin-command
some text 2
begin-command
some text 3
command-end
some text 4
begin-command-if "%VAR%" == "value"
some text 5
begin-command
some text 6
command-end
command-end
command-end
EOF;
$regex = '~
# opening tag
begin-(?P<type>command(?:-if)?)
# must not contain a nested "command" or "command-if" command!
(?!.*begin-command(?:-if)?.*command(?:-if)?-end)
# the parameters for "command-if" are optional
(?:
[\s\n]*?
(?:")[\s\n]*(?P<leftvalue>[^\\\\]*?)[\s\n]*(?:")
[\s\n]*
# the operator is optional
(?P<operator>[=<>!]*)
[\s\n]*
(?:")[\s\n]*(?P<rightvalue>[^\\\\]*?)[\s\n]*(?:")
[\s\n]*?
)?
# the real content
(?P<content>.*?)
# closing tag
command(?:-if)?-end
~smx';
$counter = 0;
$loop_replace = true;
while ($loop_replace) {
$data = preg_replace_callback($regex, function ($matches) use ($counter) {
global $counter;
$counter++;
return "<command id='{$counter}'>{$matches['content']}</command>";
}, $data, -1, $loop_replace);
}
echo $data;
I've completely removed the regular expressions for parsing now. I realized that actually the raw input can be seen as a XML markup tree in some kind of weird representation.
Instead of using regular expressions, i now do the following:
begin-command ... command-end
blocks with corresponding XML tags(Note there are actually several different commands)
This seems ugly, but i really didn't want to write my own parser - that seemed a bit "overkill" in the limited time i have for improving the speed. And oh boy, that is still blazing fast - much faster than the RegExp solution. Impressive, when you consider the overhead converting the raw input to valid XML and back.
With "blazing fast" i mean it now takes a mere ~200ms for a document which previously needed 5-7 seconds to parse with several Regular Expressions.
Here is the code I'm using now:
Your look ahead on the 4th line of your regex:
this will have to read to the end of your file every time it is encountered (with the modifiers that are being used)
making your .*'s lazy may earn you a bit of a performance boost on those large files:
also if the (?:-if)? is always going to come after begin-command you can just get rid of it there, would make it something like: