How to search urls that are not in any html tag an

2019-08-09 16:33发布

So my problem is that, in the same content there are iframes, image tags and etc. They all have regex matches that will convert them into the correct format.

The last thing left is the normal URL. I need a regex, that will find all links that are simply links and not inside of a iframe, img or any other tag. Tags used in this case are regular HTML tags and not BB.

Currently I got this code as the last pass of the content rendering. But it will also react to all the other things done above (iframes and img renderings.) So it goes and swaps the urls out there aswell.

$output = preg_replace(array(
    '%\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))%s'
), array(
    'test'
), $output);

And my content looks something like this:

# dont want these to be touched
<iframe width="640" height="360" src="http://somedomain.com/but-still-its-a-link-to-somewhere/" frameborder="0"></iframe>
<img src="http://someotherdomain.com/here-is-a-img-url.jpg" border="0" />

# and only these converted
http://google.com
http://www.google.com
https://www2.google.com<br />
www.google.com

As you can see, there also might be something at the end of the link. After a full day of trying regexes to work, that last <br /> has been a nightmare for me.

标签: php regex url
1条回答
Deceive 欺骗
2楼-- · 2019-08-09 17:02

Description

This solution will match the urls which are not inside tag attribute values, and will replace them with something new.

The regular expression matches both the things you skipped over and the things you replaced. Then the preg_match_callback executes an internal function which tests to see if capture group 1 is populated (this is the desired text) and if so returns the change, otherwise it simply returns the undesired text.

I used your url matching regex with some minor modifications like converting the unused capture groups (...) to non-capture groups (?:...). This makes the regex engine run faster and makes it easier to modify the expression.

The raw expression: <(?:[^'">=]*|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>|((?:[\w-]+:\/\/?|www[.])[^\s()<>]+(?:\([\w\d]+\)|(?:[^[:punct:]\s]|\/)))

enter image description here

Example

Code

<?php

$string = '# dont want these to be touched
<iframe width="640" height="360" src="http://somedomain.com/but-still-its-a-link-to-somewhere/" frameborder="0"></iframe>
<img src="http://someotherdomain.com/here-is-a-img-url.jpg" border="0" />

# and only these converted
http://google.com
http://www.google.com
https://www2.google.com<br />
www.google.com';


    $regex = '/<(?:[^\'">=]*|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*>|((?:[\w-]+:\/\/?|www[.])[^\s()<>]+(?:\([\w\d]+\)|(?:[^[:punct:]\s]|\/)))/ims';

    $output = preg_replace_callback(
        $regex,
        function ($matches) {
            if (array_key_exists (1, $matches)) {
                return '<a href="' . $matches[1] . '">' . $matches[1] . '<\/a>';
            }
            return $matches[0];
        },
        $string
    );
    echo $output;

Output

# dont want these to be touched
<iframe width="640" height="360" src="http://somedomain.com/but-still-its-a-link-to-somewhere/" frameborder="0"></iframe>
<img src="http://someotherdomain.com/here-is-a-img-url.jpg" border="0" />

# and only these converted
<a href="http://google.com">http://google.com<\/a>
<a href="http://www.google.com">http://www.google.com<\/a>
<a href="https://www2.google.com">https://www2.google.com<\/a><br />
<a href="www.google.com">www.google.com<\/a>
查看更多
登录 后发表回答