Text replacement: PHP/regex

2020-04-19 07:21发布

I am presented with an HTML document similar to this in view source mode (the below is simplified for brevity):

<html>
    <head>
        <title>System version: {{variable:system_version}}</title>
    </head>
    <body>
        <p>You are using system version {{variable:system_version}}</p>
        {{block:welcome}}
        <form>
            <input value="System version: {{variable:system_version}}">
            <textarea>
                You are using system version {{variable:system_version}}.
            </textarea>
        </form>
    </body>
</html>

I have written some functions that can replace these {{...}} type strings, but they need to be replaced selectively.

In the example above, I want it replaced in <title> and in <p>, but not in <input> and <textarea> as this is user-provided input, that would be inserted via a wysiwyg editor or form, and must be saved as received from the user. The {{block:welcome}} must also be replaced with whatever content it contains.

When rendering my output, I will sanitize it, then result should be something like this:

<html>
    <head>
        <title>System version: 6.0</title>
    </head>
    <body>
        <p>You are using system version 6.0</p>
        <div>
            This was the content of the welcome block.
        </div>
        <form>
            <input value="System version: {{variable:system_version}}">
            <textarea>
                You are using system version {{variable:system_version}}.
            </textarea>
        </form>
    </body>
</html>

Here is what I have tried. For the below code, $var's value is '6.0' and $val's value = '{{variable:system_version}}', and $data is the entire string to be searched:

if (!preg_match('/<textarea|<input|<select(.+?)' . $val . '(.+?)<\/textarea|<\/input|<\/select\>/s', $data)) {
    $data = str_replace($val, $var, $data);
}    

Please advise what is wrong with my regex, as it currently replaces nothing whatsoever, so the if condition is never matched. If I do the str_replace without the if, the replacements are made, in all cases.

EDIT 1

After some assistance by @Emma, the replacement still does not work. The below is the code that does the replacement as it stands:

    function replace_variable($matches, $data)
    {
        $ci =& get_instance();
        if (!empty($matches['variable_matches'])) {
            foreach ($matches['variable_matches'][0] as $key => $val) {
                $vals = explode(':', $val);
                $ci->load->module('core');
                $var = $ci->core->get_variable(rtrim($vals[1], '}}'));
                $re1 = '/<(?:textarea|select)[\s\S]*?>[\s\S]*?(' . $val . ')[\s\S]*?<\/(?:textarea|select)>/';
                $re2 = '/<(?:input)[\s\S]*?(' . $val . ')[\s\S]*?>/';
                if (!preg_match($re1, $data) && !preg_match($re2, $data)) {
                    $data = str_replace($val, $var, $data);
                }
            }
        }
        return $data;
    }

Here are the output values of the matches found via preg_match, and then I am trying to replace via str_replace where NOT inside a form tag (select/textarea/input).

Array
(
    [0] => Array
        (
            [0] => {{variable:system_version}}
            [1] => {{variable:system_version}}
            [2] => {{variable:system_version}}
            [3] => {{variable:system_version}}
        )

    [1] => Array
        (
            [0] => system_version
            [1] => system_version
            [2] => system_version
            [3] => system_version
        )

)

So - there are four matches on the page where I try to replace, two of them inside form tags, the other two not. The check is done on the entire output that is buffered, and contains all four elements, but somehow, the preg_match triggers for all of them, despite the regex. Any ideas what I am doing wrong?

2条回答
你好瞎i
2楼-- · 2020-04-19 07:25

My guess is that you are likely designing an expression similar to:

<(?:textarea|select)[\s\S]*?({{variable:system_version}})[\s\S]*?<\/(?:textarea|select)>|<(?:input)[\s\S]*?({{variable:system_version}})[\s\S]*?>

which you might probably want to modify it, and then replace with what you like to replace.

The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.

Test

$re = '/<(?:textarea|select)[\s\S]*?({{variable:system_version}})[\s\S]*?<\/(?:textarea|select)>|<(?:input)[\s\S]*?({{variable:system_version}})[\s\S]*?>/m';
$str = '<html>
    <head>
        <title>System version: 6.0</title>
    </head>
    <body>
        <p>You are using system version 6.0</p>
        <div>
            This was the content of the welcome block.
        </div>
        <form>
            <input value="System version: {{variable:system_version}}">
            <textarea>
                You are using system version {{variable:system_version}}.
            </textarea>
        </form>
    </body>
</html>';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

var_dump($matches);

RegEx Circuit

jex.im visualizes regular expressions:

enter image description here


Edit for two steps:

<(?:textarea|select)[\s\S]*?>[\s\S]*?<\/(?:textarea|select)>|<(?:input)[\s\S]*?>

Demo 1

^<(?:input)[\s\S]*?({{variable:system_version}})[\s\S]*?>$

Demo 2

^<(?:input).*?({{variable:system_version}}).*?>$

Demo 3

查看更多
走好不送
3楼-- · 2020-04-19 07:35

I was about to post an answer on your next question but Casimir closed it before I got the chance. I am coming back here to post a proper html parse-then-replace technique for the benefit of researchers and you.

Code: (Demo)

define('LOOKUP', [
    'block' => [
        'welcome-intro'         => 'custom intro'
    ],
    'variable' => [
        'contact-email-address' => 'mmu@mmu.com',
        'system_version'        => 'sys ver',
        'system_name'           => 'sys name',
        'system_login'          => 'sys login',
        'activate_url'          => 'some url'
    ],

]);

$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);

foreach ($xpath->query("//*[not(self::textarea or self::select or self::input) and contains(., '{{{')]/text()") as $node) {
    $node->nodeValue = preg_replace_callback('~{{{([^:]+):([^}]+)}}}~', function($m) {
            return LOOKUP[$m[1]][$m[2]] ?? '**unknown variable**';
        },
        $node->nodeValue);
}
echo $dom->saveHTML();

Output:

<!DOCTYPE html>
<html lang="en"><head><meta charset="utf-8"><title>Test</title></head><body>
    <section id="about"><div class="container about-container">
            <div class="row">
                <div class="col-md-12">
                    custom intro
                </div>
            </div>
        </div>
    </section><section id="services"><div class="container">
            <div class="row">
                <div class="col-md-12">
                                        <p>You are using system version: sys ver</p>
                    <p>Your address: mmu@mmu.com</p>
                    <form action="http://k.loc/content/view/welcome" class="default-form" enctype="multipart/form-data" method="post" accept-charset="utf-8">
                                                                                    <input type="hidden" name="csrfkcmstoken" value="94ee71ada809b9a79d1b723c81020c78"><div class="row">
                            <div class="col-sm-12 form-error"></div>
                        </div>
                    <div class="row"><div class="col-sm-12"><fieldset id="personalinfo"><legend>Personal information</legend><div class="row"><div class="col-sm-12">
                    <div class="control-label">
                        <label for="testinput">Name<span class="form-validation-required"> * </span></label>

                    </div>
                <div class="hint-text">Enter at least 2 characters and a maximum of 12 characters.</div><input id="testinput" name="testinput" placeholder="Enter your name here." class="input-group width-50" type="text" value="{{{variable:system_name}}}  {{{variable:system_login}}}"><div class="row"><div class="col-sm-12"><div class="form-error"></div></div></div></div></div><div class="row"><div class="col-sm-12">
                    <div class="control-label">
                        <label for="testpassword">Password</label>

                    </div>
                <div class="hint-text">Your password must be at least 12 characters long, contain 1 special character, 1 nunber, 1 lower case character and 1 upper case character.</div><input id="testpassword" name="testpassword" placeholder="Enter your password here." class="input-group width-50" type="password"><div class="row"><div class="col-sm-12"><div class="form-error"></div></div></div></div></div></fieldset></div></div><div class="row"><div class="col-sm-12"><fieldset id="bioinfo"><legend>Biographical information</legend><div class="row"><div class="col-sm-12">
                    <div class="control-label">
                        <label for="testtextarea">Biography</label>
                <span class="hint-text">A minimum of 40 characters and a maximum of 255 is allowed. This hint is displayed inline.</span>
                    </div>
                <textarea id="testtextarea" name="testtextarea" placeholder="Please enter your biography here." class="input-group-wide width-100" rows="5" cols="80">{{{variable:system_name}}}

{{{variable:system_login}}}</textarea><div class="row"><div class="col-sm-12"><div class="form-error"></div></div></div></div></div><div class="row"><div class="col-sm-12">
                    <div class="control-label">
                        <label for="testsummernote">Interests</label>
                <span class="hint-text">A minimum of 40 characters is required. This hint is displayed inline.</span>
                    </div>
                <textarea id="testsummernote" name="testsummernote" class="wysiwyg-editor" placeholder="Please enter your interests here."><p>sys name<br></p><p>sys login</p><p>some url<br></p></textarea></div></div></fieldset></div></div><div class="row"><div class="col-sm-12"><button name="testsubmit" id="testsubmit" type="submit" class="btn primary">Submit<i class="zmdi zmdi-arrow-forward"></i></button></div></div>
        </form>                </div>
            </div>
        </div>
    </section></body></html>

There aren't too many tricks involved.

  1. Parse the HTML with DOMDocument and write a filtering query with XPath which requires nodes to not be textarea|select|input tags and they must contain {{{ in their text. There will be several "magical" ways to filter the dom -- this is just one way that feels efficient/direct to me.

  2. I use preg_replace_callback() to perform replacements based on a lookup array.

  3. To avoid use() in the callback syntax, I make the lookup available inside the callback's scope by declaring it as a constant (I can't imagine you need it to be a variable anyhow).

  4. I found during testing that DOMDocument didn't like the <section> tags, so I silenced the complaints with libxml_use_internal_errors(true);.

查看更多
登录 后发表回答