I'm trying to figure out a regex that matches all URL that are not an attribute of an element or is a content of a hyperlink.
Should match:
1. This is a url http://www.google.com
Should not match:
1. <a href="http://www.google.com">Google</a>
2. <a href="http://www.google.com">http://www.google.com</a>
3. <img src="http://www.google.com/image.jpg">
4. <div data-url="http://www.google.com"></div>
I'm currently using this regex to match all URL and I think I know what I have to detect, but I just can't figure out using regex.
\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]
EDITED
What I'm trying to achieve is the following. I want to convert this string.
This is a url http://www.google.com <a href="http://www.google.com" title="Go to Google">Google</a><a href="http://www.google.com">http://www.google.com</a><img src="http://www.google.com/image.jpg"><div data-url="http://www.google.com"></div>
To
This is a url <a href="http://www.google.com">http://www.google.com</a> <a href="http://www.google.com" title="Go to Google">Google</a><a href="http://www.google.com">http://www.google.com</a><img src="http://www.google.com/image.jpg"><div data-url="http://www.google.com"></div>
Preprocessing by removing tags and then put them back doesn't solve the problem since actually ends up removing all data attributes of the existing hyperlink elements. It also doesn't solve the issue when there are other URL using in other attributes beside href.
So far, I haven't found a solution suggested by anyone and so far I also haven't found a way to do this using HTML parser. It's actually seem more doable using regex.
EDITED 2
After the attempt based on Dean's suggestion, I'm about ready to rule out HTML parser from being able to achieve this for it inability to process string without making it a valid HTML document. Here's the code based on the suggested example + the fix to handle exclusion case 2.
Document doc = Jsoup.parseBodyFragment(htmlText);
final List<TextNode> nodesToChange = new ArrayList<TextNode>();
NodeTraversor nd = new NodeTraversor(new NodeVisitor() {
@Override
public void tail(Node node, int depth) {
if (node instanceof TextNode) {
TextNode textNode = (TextNode) node;
Node parent = node.parent();
if(parent.nodeName().equals("a")){
return;
}
String text = textNode.getWholeText();
List<String> allMatches = new ArrayList<String>();
Matcher m = Pattern.compile("\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]")
.matcher(text);
while (m.find()) {
allMatches.add(m.group());
}
if(allMatches.size() > 0){
nodesToChange.add(textNode);
}
}
}
@Override
public void head(Node node, int depth) {
}
});
nd.traverse(doc.body());
This code adds HTML, HEAD and BODY tag to the result. The only hack I can think of around this issue is to check whether HTML, HEAD and BODY tags exist in the string. If not, stripe them out after processing.
I hope someone else has a better suggestion than this hack. Using JSOUP is already very expensive in terms of processing time so I really don't want to add more overhead if I don't have to.
As discussed in the comments to the question, solving this using a regular expression only is hard (may be impossible?). Below is an XSLT Stylesheet, that does a preprocessing step to remove all attributes and all anchor tags from the input html.
Then you can run your regex to extract the remaining urls, which will be much simpler.
If your input html is not valid, then use jtidy, htmlcleaner or htmltidy as a further preprocessing step.
Hope this helps.
Based on suggestion by Dean and the mentioned example, here's the "solution" to the problem. Keep in mind that it's a very expensive one due mainly to the parsing of HTML string (~160ms on quad-core/16GB RAM MBPr). This solution also handles both valid and invalid HTML. Keep in mind there is a little hack around the limitation of JSOUP to make sure extra tags are not included to make the end result a valid HTML. I really hope someone can come up with a better solution, but here it is for now.
Expecting Valid HTML Output
Here is rough guide to get you started.
Parse your HTML something like this:
Edit 1 - The Crazy World of replacing in Invalid HTML
It seems the author of this question has indicated that the content is not valid HTML and requires the invalid HTML to be maintained - as such a HTML parser shouldn't be used as any HTML parser would likely output valid HTML when saving.
As indicated in my comment to the original question you can use negative look behinds in regex. But only a fool would parse HTML with RegEx - apparently we aren't so here is one possible example.
I wouldn't use this in production code - but it answers OP's question
The RegEx
Unfortunately Java doesn't support unlimited look-behinds so I have included the following limits:
Negative Look-behind
Note that this visualization is incorrect as
[\p{L}0-9_.-]
was replaced with[A-Z0-9_.-]
to get visualisation to work - but\p{L}
is technically more correct as "Any Unicode Letter" is possible.Complete Regex
The Java
No idea what the performance would be like on this - look-behinds are expensive - that's on top of RegEx generally being expensive too.