Javascript regex: Find all URLs outside tags -

2020-02-12 17:15发布

问题:

I have built this regex code:

((https?|ftps?):\/\/[^"<\s]+)(?![^<>]*?>|[^<>]*?<\/)

The first group captures all links in HTML and the second is a negative lookahead to exclude any parts inside tags as attributes and any parts inside tags as content.

I would like that only <a> tags are excluded - so the solution could be to modify only the last term to:

[^<>]*?<\/a>

But now there will be a problem if I have nested tags, for example, <b></b> inside <a>.

Here is the example I am working on: https://regex101.com/r/lM3hC5/6 (should be 10 matches).

Negative lookahead is still tricky for me. I thought that the following should work but it isn't:

(?!<a.+?<\/a>)

https://regex101.com/r/hT1cG5/1

These are the last discussions that helped me:

  • Regex replace text outside html tags

  • Regex replace text but exclude when text is between specific tag

回答1:

It turned out that probably the best solution is the following:

((https?|ftps?):\/\/[^"<\s]+)(?![^<>]*>|[^"]*?<\/a)

Looks like that the negative lookahead is working properly only if it starts with quantifiers and not strings. For such a case, it follows that practically we can do backtracks only.

Again, we just want to make sure that nothing inside HTML tags as attributes is messed up. Then we do a backtrack starting from </a up to the first " symbol (as it is not a valid URL symbol but <> symbols are present with nested tags).

Now also nested tags inside <a> tags are found properly. Of course, the code is not perfect but it should work with almost any simple HTML markup. Just you may need to be a bit careful with:

  • placing quotes within <a> tags;
  • do not use this algorithm on <a> tags without any attribute (placeholders);
  • as well as you may need to avoid using multiple nested tags/lines unless the URL inside <a> tag is after any double quote.


Here is a very good and messy example (the last match should not be found but it is):

https://regex101.com/r/pC0jR7/2

It is a pity that this lookahead does not work: (?!<a.*?<\/a>)