I wrote this function to convert all specific URLs(mywebsite.com) to links, and strip other URLs to @@@spam@@@.
function get_global_convert_all_urls($content) {
$content = strtolower($content);
$replace = "/(?:http|https)?(?:\:\/\/)?(?:www.)?(([A-Za-z0-9-]+\.)*[A-Za-z0-9-]+\.[A-Za-z]+)(?:\/.*)?/im";
preg_match_all($replace, $content, $search);
$total = count($search[0]);
for($i=0; $i < $total; $i++) {
$url = $search[0][$i];
if(preg_match('/mywebsite.com/i', $url)) {
$content = str_replace($url, '<a href="'.$url.'">'.$url.'</a>', $content);
} else {
$content = str_replace($url, '@@@spam@@@', $content);
}
}
return $content;
}
The only problem that i can't solve is, the regex not ending on space if 2 URLs in one line.
$content = "http://www.mywebsite.com/index.html http://www.others.com/index.html";
Result:
<a href="http://www.mywebsite.com/index.html http://www.others.com/index.html">http://www.mywebsite.com/index.html http://www.others.com/index.html</a>
How can i get the result below:
<a href="http://www.mywebsite.com/index.html">http://www.mywebsite.com/index.html</a> @@@spam@@@
I have tried add this (\s|$) at the ending of regex but no luck:
/(?:http|https)?(?:\:\/\/)?(?:www.)?(([A-Za-z0-9-]+\.)*[A-Za-z0-9-]+\.[A-Za-z]+)(?:\/.*)?(\s|$)/im
Edited based on change in your question.
The problem is your .* at the end of your regex, so my suggestion is to replace it with a more precise expression. I cooked this up real quick, you'll want to some tests to verify your cases. =)
$matches = null;
$returnValue = preg_match_all('!(?:http|https)?(?:\\:\\/\\/)?(?:www.)?(([A-Za-z0-9-]+\\.)*[A-Za-z0-9-]+\\.[A-Za-z]+)(:[a-zA-Z0-9]*)?/?([a-zA-Z0-9\\-\\._\\?\\,\\\'/\\\\\\+&%\\$#\\=~])*[^\\.\\,\\)\\(]!', 'mywebsite.com/index.html others.com/index.html', $matches);
Results in:
array (
0 =>
array (
0 => 'mywebsite.com/index.html ',
1 => 'others.com/index.html',
),
1 =>
array (
0 => 'mywebsite.com',
1 => 'others.com',
),
2 =>
array (
0 => '',
1 => '',
),
3 =>
array (
0 => '',
1 => '',
),
4 =>
array (
0 => 'l',
1 => 'm',
),
)
Change the last element of the regex (?:\/.*)?
into \S*
.
Your regex matches every character till the end of the string including spaces, \S*
matches every character that is not a space.
You could also simplified the whole regex into:
$replace = "~(?:https?://)?(?:www\.)?(([A-Z0-9-]+\.)*[A-Z0-9-]+\.[A-Z]+)\S*~im";
Change the regexp pattern to capture the last url section(/index.html
, /index.php
).
/(?:http|https)?(?:\:\/\/)?(?:www.)?(([A-Za-z0-9-]+?\.)?[A-Za-z0-9-]+?\.?[A-Za-z]*?(\/\w+?\.\w+?)?)\b/im
Change your function content as shown below:
$content = "http://www.mywebsite.com/index.html http://www.others.com/index.html";
function get_global_convert_all_urls($content) {
$content = strtolower($content);
$replace = "/(?:http|https)?(?:\:\/\/)?(?:www.)?(([A-Za-z0-9-]+?\.)?[A-Za-z0-9-]+?\.?[A-Za-z]*?(\/\w+?\.\w+?)?)\b/im";
preg_match_all($replace, $content, $search);
foreach ($search[0] as $url) {
if(preg_match('/mywebsite.com/i', $url)) {
$content = str_replace($url, '<a href="'.$url.'">'.$url.'</a>', $content);
} else {
$content = str_replace($url, '@@@spam@@@', $content);
}
}
return $content;
}
var_dump(get_global_convert_all_urls($content));
The output:
string '<a href="http://www.mywebsite.com/index.html">http://www.mywebsite.com/index.html</a> @@@spam@@@'