I have built this regex code:
((https?|ftps?):\/\/[^"<\s]+)(?![^<>]*?>|[^<>]*?<\/)
The first group captures all links in HTML and the second is a negative lookahead to exclude any parts inside tags as attributes and any parts inside tags as content.
I would like that only <a>
tags are excluded - so the solution could be to modify only the last term to:
[^<>]*?<\/a>
But now there will be a problem if I have nested tags, for example, <b></b>
inside <a>
.
Here is the example I am working on: https://regex101.com/r/lM3hC5/6 (should be 10 matches).
Negative lookahead is still tricky for me. I thought that the following should work but it isn't:
(?!<a.+?<\/a>)
https://regex101.com/r/hT1cG5/1
These are the last discussions that helped me:
It turned out that probably the best solution is the following:
((https?|ftps?):\/\/[^"<\s]+)(?![^<>]*>|[^"]*?<\/a)
Looks like that the negative lookahead is working properly only if it starts with quantifiers and not strings. For such a case, it follows that practically we can do backtracks only.
Again, we just want to make sure that nothing inside HTML tags as attributes is messed up. Then we do a backtrack starting from </a
up to the first "
symbol (as it is not a valid URL symbol but <>
symbols are present with nested tags).
Now also nested tags inside <a>
tags are found properly. Of course, the code is not perfect but it should work with almost any simple HTML markup. Just you may need to be a bit careful with:
- placing quotes within
<a>
tags;
- do not use this algorithm on
<a>
tags without any attribute (placeholders);
- as well as you may need to avoid using multiple nested tags/lines unless the URL inside
<a>
tag is after any double quote.
Here is a very good and messy example (the last match should not be found but it is):
https://regex101.com/r/pC0jR7/2
It is a pity that this lookahead does not work: (?!<a.*?<\/a>)