I've come across a problem that seems really weird to me.
I'm scraping a website using Jsoup:
Elements names = doc.select(".Mod.Thm-inherit").select("h3");
for (Element e : names) {
System.out.println(e.text());
}
My output is (Fantasy hockey team names, names changed for simplicity):
Team One ?
Team Two ?
Team Three ?
Team Four ?
Team Five ?
//etc
Now the actual team names don't have the extra space or question mark. Thinking I could just replace it, I tried:
String str = e.text().replaceAll("\\?", "");
System.out.println(str);
This however still outputs the question mark at the end. I'm thinking that this might mean that it's a character that Eclipse/Java doesn't recognize. (Note: It doesn't display a �, it's really just the generic ?
)
When looking at the HTML code, there are no extra characters though:
<script charset="utf-8" type="text/javascript" language="javascript">
<!-- Bunch of HTML -->
<div class="Grid-u-1-2 Pend-xl"><h3 class="My-xl Ta-c Fz-lg"><a href="/hockey/27381/1">Team One</a>
Anyone know why this is happening?
Edit: I was quickly able to solve the issue by just doing a substring
and removing the last 2 characters, but I'd still like to know why it's happening.
Edit2: Playing around with it more, I found that if I (int)
cast the question mark, it gives me 57399, instead of ?
's regular 63. So definitely some sort of unknown character issue. Just not sure why it's being added or what that character is supposed to represent.
I think there must be extra
h3
fields with strange characters inside your".Mod.Thm-inherit"
element.For a complete solution you must provide more information as @Jim Garrison said.
The following code:
Gives me the expected output
Team One
. With no strange characters at all.Hope it helps. Best regards.