I am trying to write a RegEx rule to find all a href HTML links on my webpage and add a 'rel="nofollow"' to them.
However, I have a list of URLs that must be excluded (for exmaple, ANY (wildcards) internal link (eg. pokerdiy.com) - so that any internal link that has my domain name in is excluded from this. I want to be able to specify exact URLs in the exclude list too - for example - http://www.example.com/link.aspx)
Here is what I have so far which is not working:
(]+)(href="http://.*?(?!(pokerdiy))[^>]+>)
If you need more background/info you can see the full thread and requirements here (skip the top part to get to the meat): http://www.snapsis.com/Support/tabid/601/aff/9/aft/13117/afv/topic/afpgj/1/Default.aspx#14737
I've developed a slightly more robust version that can detect whether the anchor tag already has "rel=" in it, therefore not duplicating attributes.
Matches
But doesn't match
Replace using
Hope this helps someone!
James
would match the first part of any link that starts with
http://
orhttps://
and doesn't containpokerdiy.com
orwww.example.com/link.aspx
anywhere in thehref
attribute. Replace that byIf a
rel="nofollow"
is already present, you'll end up with two of these. And of course, relative links or other protocols likeftp://
etc. won't be matched at all.Explanation:
(?!\b(foo|bar)\b)[^"]
matches any non-"
character unless it it possible to matchfoo
orbar
at the current location. The\b
s are there to make sure we don't accidentally trigger onrebar
orfoonly
.This whole contruct is repeated (
(?: ... )+
), and whatever is matched is preserved in backreference\2
.Since the next token to be matched is a
"
, the entire regex fails if the attribute containsfoo
orbar
anywhere.An improvement to James' regex:
This regex will matches links NOT in the string array $follow_list. The strings don't need a leading 'www'. :) The advantage is that this regex will preserve other arguments in the tag (like target, style, title...). If a
rel
argument already exists in the tag, the regex will NOT match, so you can force follows on urls not in $follow_listReplace the with:
Full example (PHP):
If you want to overwrite
rel
no matter what, I would use apreg_replace_callback
approach where in the callback the rel attribute is replaced separately: