Java Regex doesn't work with special chars

I got a problem with my parser. I want to read an image-link on a webiste and this normally works fine. But today I got a link that contains special chars and the usual regex did not work.

This is how my code looks like.

Pattern t = Pattern.compile(regex.trim());

Matcher x = t.matcher(content[i].toString());
if(x.find())
{
    values[i] = x.group(1);
}

And this is the part of html, that causes trouble

<div class="open-zoomview zoomlink" itemscope="" itemtype="http://schema.org/Product"> 
<img class="zoomLink productImage" src="

http://tnm.scene7.com/is/image/TNM/template_335x300?$plus_335x300$&amp;$image=is{TNM/1098845000_prod_001}&amp;$ausverkauft=1&amp;$0prozent=1&amp;$versandkostenfrei=0" alt="Produkt Atika HB 60 Benzin-Heckenschere" title="Produkt Atika HB 60 Benzin-Heckenschere" itemprop="image" /> 
</div>

And this is the regex I am using to get the part in the src-attribute:

<img .*src="(.*?)" .*>

I believe that it has something to do with all the special character inside the link. But I'm not sure how to escape all of them. I Already tried

Pattern.quote(content[i].toString())

But the outcome was the same: nothing found.

标签： java regex

4条回答

小情绪 Triste *

2楼-- · 2020-05-01 10:22

This probably caused by the newline within the tag. The . character won't match it.

Did you consider not using regex to parse HTML? Using regex for HTML parsing is notoriously fragile construct. Please consider using a parsing library such as JSoup for this.

0人赞添加讨论(0) 举报

来，给爷笑一个

3楼-- · 2020-05-01 10:24

You should actually use <img\\s\\.*?\\bsrc=["'](\\.*?)["']\\.*?> with (?s) modifier.

0人赞添加讨论(0) 举报

smile是对你的礼貌

4楼-- · 2020-05-01 10:33

The . character usually only matches everything except new line characters. Therefore, your pattern won't match if there are newlines in the img-tag.

Use Pattern.compile(..., Pattern.DOTALL) or prepend your pattern with (?s).

In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators.

http://docs.oracle.com/javase/1.5.0/docs/api/java/util/regex/Pattern.html#DOTALL

0人赞添加讨论(0) 举报

傲

5楼-- · 2020-05-01 10:39

You regex should be like:

String regex = "<img .*src=\"(.*?)\" .*>";

0人赞添加讨论(0) 举报

Java Regex doesn't work with special chars

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间