PHP regex match specific URL and strip others

2019-09-10 01:29发布

I wrote this function to convert all specific URLs(mywebsite.com) to links, and strip other URLs to @@@spam@@@.

function get_global_convert_all_urls($content) {
  $content = strtolower($content);
  $replace = "/(?:http|https)?(?:\:\/\/)?(?:www.)?(([A-Za-z0-9-]+\.)*[A-Za-z0-9-]+\.[A-Za-z]+)(?:\/.*)?/im";
  preg_match_all($replace, $content, $search);
  $total = count($search[0]);
  for($i=0; $i < $total; $i++) {
  $url = $search[0][$i];
    if(preg_match('/mywebsite.com/i', $url)) {
      $content = str_replace($url, '<a href="'.$url.'">'.$url.'</a>', $content);            
    } else {
      $content = str_replace($url, '@@@spam@@@', $content); 
    }
  } 

  return $content;
}

The only problem that i can't solve is, the regex not ending on space if 2 URLs in one line.

$content = "http://www.mywebsite.com/index.html http://www.others.com/index.html";

Result:

<a href="http://www.mywebsite.com/index.html http://www.others.com/index.html">http://www.mywebsite.com/index.html http://www.others.com/index.html</a>

How can i get the result below:

<a href="http://www.mywebsite.com/index.html">http://www.mywebsite.com/index.html</a> @@@spam@@@   

I have tried add this (\s|$) at the ending of regex but no luck:

/(?:http|https)?(?:\:\/\/)?(?:www.)?(([A-Za-z0-9-]+\.)*[A-Za-z0-9-]+\.[A-Za-z]+)(?:\/.*)?(\s|$)/im

3条回答
时光不老,我们不散
2楼-- · 2019-09-10 01:59

Change the regexp pattern to capture the last url section(/index.html, /index.php).

/(?:http|https)?(?:\:\/\/)?(?:www.)?(([A-Za-z0-9-]+?\.)?[A-Za-z0-9-]+?\.?[A-Za-z]*?(\/\w+?\.\w+?)?)\b/im

Change your function content as shown below:

$content = "http://www.mywebsite.com/index.html http://www.others.com/index.html";

function get_global_convert_all_urls($content) {
  $content = strtolower($content);
  $replace = "/(?:http|https)?(?:\:\/\/)?(?:www.)?(([A-Za-z0-9-]+?\.)?[A-Za-z0-9-]+?\.?[A-Za-z]*?(\/\w+?\.\w+?)?)\b/im";
  preg_match_all($replace, $content, $search);

  foreach ($search[0] as $url) {
    if(preg_match('/mywebsite.com/i', $url)) {
      $content = str_replace($url, '<a href="'.$url.'">'.$url.'</a>', $content);         
    } else {
      $content = str_replace($url, '@@@spam@@@', $content); 
    }
  } 

  return $content;
}

var_dump(get_global_convert_all_urls($content)); 

The output:

string '<a href="http://www.mywebsite.com/index.html">http://www.mywebsite.com/index.html</a> @@@spam@@@'
查看更多
成全新的幸福
3楼-- · 2019-09-10 02:11

Edited based on change in your question.

The problem is your .* at the end of your regex, so my suggestion is to replace it with a more precise expression. I cooked this up real quick, you'll want to some tests to verify your cases. =)

$matches = null;
$returnValue = preg_match_all('!(?:http|https)?(?:\\:\\/\\/)?(?:www.)?(([A-Za-z0-9-]+\\.)*[A-Za-z0-9-]+\\.[A-Za-z]+)(:[a-zA-Z0-9]*)?/?([a-zA-Z0-9\\-\\._\\?\\,\\\'/\\\\\\+&%\\$#\\=~])*[^\\.\\,\\)\\(]!', 'mywebsite.com/index.html others.com/index.html', $matches);

Results in:

array (
  0 => 
  array (
    0 => 'mywebsite.com/index.html ',
    1 => 'others.com/index.html',
  ),
  1 => 
  array (
    0 => 'mywebsite.com',
    1 => 'others.com',
  ),
  2 => 
  array (
    0 => '',
    1 => '',
  ),
  3 => 
  array (
    0 => '',
    1 => '',
  ),
  4 => 
  array (
    0 => 'l',
    1 => 'm',
  ),
)
查看更多
叛逆
4楼-- · 2019-09-10 02:15

Change the last element of the regex (?:\/.*)? into \S*.

Your regex matches every character till the end of the string including spaces, \S* matches every character that is not a space.

You could also simplified the whole regex into:

$replace = "~(?:https?://)?(?:www\.)?(([A-Z0-9-]+\.)*[A-Z0-9-]+\.[A-Z]+)\S*~im";
查看更多
登录 后发表回答