I am trying to convert, from a textarea input ($_POST['content']
), all urls to link.
$content = preg_replace('!(\s|^)((https?://)+[a-z0-9_./?=&-]+)!i', ' <a href="$2" target="_blank">$2</a> ', nl2br($_POST['content'])." ");
$content = preg_replace('!(\s|^)((www\.)+[a-z0-9_./?=&-]+)!i', '<a target="_blank" href="http://$2" target="_blank">$2</a> ', $content." ");
Target link formats: www.hello.com
or http(s)://(www).hello.com
But this seem to break any iframe, image or similar,
How is/are the right regex that will ignore urls in html tags?
Note: I know I need two expressions; one to detect no protocol links (like www.hello.com
, so I need to prepend it) and another one to detect urls with protocol (so no need to prepend).
Your code as it is should not be much of a problem within iframes and so on, because in there you usually have a "
in front of your URL and not a space, as your pattern requires.
However, here is different solution. It might not work 100% if you have single <
or >
within HTML comments or something similar. But in any other case, it should server you well (and I do not whether this is a problem for you or not). It uses a negative lookahead to make sure that there is no closing >
before any opening <
(because this means, you are inside a tag).
$content = preg_replace('$(\s|^)(https?://[a-z0-9_./?=&-]+)(?![^<>]*>)$i', ' <a href="$2" target="_blank">$2</a> ', $content." ");
$content = preg_replace('$(\s|^)(www\.[a-z0-9_./?=&-]+)(?![^<>]*>)$i', '<a target="_blank" href="http://$2" target="_blank">$2</a> ', $content." ");
In case you are not familiar with this technique, here is a bit more elaboration.
(?! # starts the lookahead assertion; now your pattern will only match, if this subpattern does not match
[^<>] # any character that is neither < nor >; the > is not strictly necessary but might help for optimization
* # arbitrary many of those characters (but in a row; so not a single < or > in between)
> # the closing >
) # ends the lookahead subpattern
Note that I changed the regex delimiters, because I am now using !
within the regex.
Unless you need the first subpattern (\s|^)
for the URLs outside of tags as well, you can now remove that, too (and decrease the capture variables in the replacement).
$content = preg_replace('$(https?://[a-z0-9_./?=&-]+)(?![^<>]*>)$i', ' <a href="$1" target="_blank">$1</a> ', $content." ");
$content = preg_replace('$(www\.[a-z0-9_./?=&-]+)(?![^<>]*>)$i', '<a target="_blank" href="http://$1" target="_blank">$1</a> ', $content." ");
And lastly... do you intend not to replace URLs that contain anchors at the end? E.g. www.hello.com/index.html#section1
? If you missed this by accident, add the #
to your allowed URL characters:
$content = preg_replace('$(https?://[a-z0-9_./?=&#-]+)(?![^<>]*>)$i', ' <a href="$1" target="_blank">$1</a> ', $content." ");
$content = preg_replace('$(www\.[a-z0-9_./?=&#-]+)(?![^<>]*>)$i', '<a target="_blank" href="http://$1" target="_blank">$1</a> ', $content." ");
EDIT: Also, what about +
and %
? There are also a few other characters that are allowed to appear in a URL without being encoded. See this. END OF EDIT
I think this should do the trick for you. However, if you could provide an example that shows working and broken URLs (with the code you have), we could actually provide solutions that are tested to work for all of your cases.
One final thought. The proper solution would be to use a DOM parser. Then you could simply apply the regex you already have only to text nodes. However, your concern for the HTML structure is very restricted, and that makes your problem regular again (as long as you do not have unmatched '<' or '>' in HTML comments or JavaScript or CSS on the page). If you do have those special cases, you should really look into a DOM parser. None of the solutions presented here (so far) will be safe in that case.
- In my opinion url is everything that starts with
https?://
and ends with space or end of the line (vertical space or so called new line).
- Because of the first point, images, links etc. will not be replaced, because they all start with " or > (except if link
<a href=" http...">
starts with the space, but this is invalid html).
- Modifier
/m
tells the regex to match every line (so that matching described in the first point will work).
- Function
nl2br()
should be used after replacement (because of the links that start on the beginning of the line).
- Space before and after are added only if space originally exists in the $content (see $1 and $3 in the second parameter of the preg_replace() function).
- This solution supports domain names with special characters, like www.moški.si.
Input:
Code:
<?php
$content =
preg_replace(
'~(\s|^)(https?://.+?)(\s|$)~im',
'$1<a href="$2" target="_blank">$2</a>$3',
$content
);
$content =
preg_replace(
'~(\s|^)(www\..+?)(\s|$)~im',
'$1<a href="http://$2" target="_blank">$2</a>$3',
$content
);
$content = nl2br($content);
Output:
Edit:
Example of links without https?://
prefixes + example of single preg_replace()
call (patterns & replacements are array):
$content =
preg_replace(
array(
'~(\s|^)(www\..+?)(\s|$)~im',
'~(\s|^)(https?://)(.+?)(\s|$)~im',
),
array(
'$1http://$2$3',
'$1<a href="$2$3" target="_blank">$3</a>$4',
),
$content
);
$content = nl2br($content);
Let me suggest something less straight forward: split the input text into the html and non-html parts, then process the non-html parts with your regexp combining the text back into one piece. Smth. like:
<?php
$chunks = preg_split('/(<.*>)/Ums', $_POST['content'], -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
$result = '';
foreach ($chunks as $chunk) {
if (substr($chunk,0,1) != '<') {
/* do your processing on $chunk */
}
$result .= $chunk;
}
Some additional advices:
- try to save the source text and do the transformation when displaying it. This will allow you to improve/fix your rendering code if in future you find a new problem/idea.
- (https?://)+ shouldn't be in brackets and you don't need +, cause it matches "https://https://some.com" - just put https?://[a-z0-9_./?=&-]+
- the same about (www.)+ :)
This has been done hundreds of times over before. On this page either m-buettner and glavić work fine although I like glivic's shorter expression.
Here's a good php resource to do it:
http://code.iamcal.com/php/lib_autolink/
Repeats on Stackoverflow:
- How do I linkify urls in a string with php?
- PHP Linkify Links In Content
Decent in-depth article:
- http://buildinternet.com/2010/05/how-to-automatically-linkify-text-with-php-regular-expressions/