Question mark (char 57399) added to HTML element t

2019-09-05 04:19发布

I've come across a problem that seems really weird to me.

I'm scraping a website using Jsoup:

Elements names = doc.select(".Mod.Thm-inherit").select("h3");

for (Element e : names) {
    System.out.println(e.text());
}

My output is (Fantasy hockey team names, names changed for simplicity):

Team One ?
Team Two ?
Team Three ?
Team Four ?
Team Five ? 
//etc

Now the actual team names don't have the extra space or question mark. Thinking I could just replace it, I tried:

String str = e.text().replaceAll("\\?", "");
System.out.println(str);

This however still outputs the question mark at the end. I'm thinking that this might mean that it's a character that Eclipse/Java doesn't recognize. (Note: It doesn't display a �, it's really just the generic ?)

When looking at the HTML code, there are no extra characters though:

<script charset="utf-8" type="text/javascript" language="javascript">
<!-- Bunch of HTML -->
<div class="Grid-u-1-2 Pend-xl"><h3 class="My-xl Ta-c Fz-lg"><a href="/hockey/27381/1">Team One</a>

Anyone know why this is happening?

Edit: I was quickly able to solve the issue by just doing a substring and removing the last 2 characters, but I'd still like to know why it's happening.

Edit2: Playing around with it more, I found that if I (int) cast the question mark, it gives me 57399, instead of ?'s regular 63. So definitely some sort of unknown character issue. Just not sure why it's being added or what that character is supposed to represent.

1条回答
虎瘦雄心在
2楼-- · 2019-09-05 05:03

I think there must be extra h3 fields with strange characters inside your ".Mod.Thm-inherit"element.

For a complete solution you must provide more information as @Jim Garrison said.

The following code:

    String html ="<div class=\"Grid-u-1-2 Pend-xl\"><h3 class=\"My-xl Ta-c Fz-lg\"><a href=\"/hockey/27381/1\">Team One</a>";
    Document doc = Jsoup.parse(html);
    Elements names = doc.select("h3");
    for (Element e : names) {
        System.out.println(e.text());
    }

Gives me the expected output Team One. With no strange characters at all.

Hope it helps. Best regards.

查看更多
登录 后发表回答