I need a REGEX that can find blocks of PHP code in a file. For example:
<? print '<?xml version="1.0" encoding="UTF-8"?>';?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<?php echo "stuff"; ?>
</head>
<html>
When parsed would by the REGEX would return:
array(
"<? print '<?xml version=\"1.0\" encoding="UTF-8"?>';?>",
"<? echo \"stuff\"; ?>"
);
You can assume the PHP is valid.
With token_get_all
you get a list of PHP language tokens of a given PHP code. Then you just need to iterate the list, look for the open tag tokens and for the corresponding close tags.
$blocks = array();
$opened = false;
foreach (token_get_all($code) as $token) {
if (!$opened) {
if (is_array($token) && ($token[0] === T_OPEN_TAG || $token[0] === T_OPEN_TAG_WITH_ECHO)) {
$opened = true;
$buffer = $token[1];
}
} else {
if (is_array($token)) {
$buffer .= $token[1];
if ($token[0] === T_CLOSE_TAG) {
$opened = false;
$blocks[] = $buffer;
}
} else {
$buffer .= $token;
}
}
}
This is the type of task that is much better suited for a custom parser. You could relatively easily construct one using a stack and I can guarantee you will be done much quicker and pull less hair out than you would trying to debug your regex.
Regular expressions are great tools when used appropriately but not all text parsing tasks are equal.
Try the following regex using preg_match()
/<\?(?:php)?\s+(.*?)\?>/
That's untested, but is a start. It assumes a closing PHP tag (arguably well-formed).
Try this regex(untested):
preg_match_all('@<\?.*?\?>@si',$html,$m);
print_r($m[0]);
<\?(?:php)?\s+.*?\?>$
with the following modifiers:
Dot match newlines
^& match at line breaks