I need you help here.
I want to turn this:
sometext sometext http://www.somedomain.com/index.html sometext sometext
into:
sometext sometext <a href="http://somedoamai.com/index.html">www.somedomain.com/index.html</a> sometext sometext
I have managed it by using this regex:
preg_replace("#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie", "'<a href=\"$1\" target=\"_blank\">$1</a>$4'", $text);
The problem is it’s also replacing the the img
URL, for example:
sometext sometext <img src="http//domain.com/image.jpg"> sometext sometext
is turned into:
sometext sometext <img src="<a href="http//domain.com/image.jpg">domain.com/image.jpg</a>"> sometext sometext
Please help.
If you'd like to keep using a regex (and in this case, a regex is quite appropriate), you can have the regex match only URLs that "stand alone". Using a word boundary escape sequence (
\b
), you can only have the regex match wherehttp
is immediately preceded by whitespace or the beginning of the text:Thus,
"http://..."
won't match, buthttp://
as its own word will.You can try my code from this question:
If you wanna turn some other tags - that's easy enough:
DomDocument is more mature and runs much faster, so it's just an alternative if someone wants to use PHP Simple HTML DOM Parser:
You shouldn’t do that with regular expressions – at least not regular expressions only. Use a proper HTML DOM parser like the one of PHP’s DOM library instead. You then can iterate the nodes, check if it’s a text node and do the regular expression search and replace the text node appropriately.
Something like this should do it:
Ok, since the DOMNodeLists of
getElementsByTagName
andchildNodes
are live, every change in the DOM is reflected to that list and thus you cannot useforeach
that would also iterate the newly added nodes. Instead, you need to usefor
loops instead and keep track of the elements added to increase the index pointers and at best pre-calculated array boundaries appropriately.But since that is quite difficult in such a somehow complex algorithm (you would need one index pointer and array boundary for each of the three
for
loops), using a recursive algorithm is more convenient:Here
mapOntoTextNodes
is used to map a given callback function onto every DOMText node in a DOM document. You can either pass the whole DOMDocument node or just a specific DOMNode (in this case just theBODY
node).The function
foo
is then used to find and replace the plain URLs in the DOMText node’s content by splitting the content string into non-URL/URL parts usingpreg_split
while capturing the used delimiter resulting in an array of 1+2·n items. Then the non-URL parts are replaced by new DOMText nodes and the URL parts are replaced by newA
elements that are then inserted before the origin DOMText node that is then removed at the end. Since thismapOntoTextNodes
walks recursively, it suffices to just call that function on a specific DOMNode.Streamlined version of Gumbo's above:
Let's use an XPath that only fetches those elements that actually are textnodes containing http:// or https:// or ftp:// and that are not themselves textnodes of anchor elements.
The XPath above will give us a TextNode with the following data:
Since PHP5.3 we could also use PHP inside the XPath to use the Regex pattern to select our nodes instead of the three calls to contains.
Instead of splitting the textnodes apart in the standards compliant way, we will use a document fragment and just replace the entire textnode with the fragment. Non-standard in this case only means, the method we will be using for this, is not part of the W3C specification of the DOM API.
and this will then output:
match a whitespace (\s) at the start and end of the url string, this will ensure that
is not matched by
is matched;