Trim() in Java not working the way I expect? [dupl

2019-09-22 01:21发布

问题:

Possible Duplicate:
Query about the trim() method in Java

I am parsing a site's usernames and other information, and each one has a bunch of spaces after it (but spaces in between the words). For example: "Bob the Builder " or "Sam the welder ". The numbers of spaces vary from name to name. I figured I'd just use .trim(), since I've used this before. However, it's giving me trouble. My code looks like this:

for (int i = 0; i < splitSource3.size(); i++) {
            splitSource3.set(i, splitSource3.get(i).trim());
}

The result is just the same; no spaces are removed at the end. Thank you in advance for your excellent answers!

UPDATE:

The full code is a bit more complicated, since there are HTML tags that are parsed out first. It goes exactly like this:

for (String s : splitSource2) {
        if (s.length() > "<td class=\"dddefault\">".length() && s.substring(0, "<td class=\"dddefault\">".length()).equals("<td class=\"dddefault\">")) {
                splitSource3.add(s.substring("<td class=\"dddefault\">".length()));
        }
}

System.out.println("\n");
    for (int i = 0; i < splitSource3.size(); i++) {
            splitSource3.set(i, splitSource3.get(i).substring(0, splitSource3.get(i).length() - 5));
            splitSource3.set(i, splitSource3.get(i).trim());
            System.out.println(i + ": " + splitSource3.get(i));
    }
}

UPDATE:

Calm down. I never said the fault lay with Java, and I never said it was a bug or broken or anything. I simply said I was having trouble with it and posted my code for you to collaborate on and help solve my issue. Note the phrase "my issue" and not "java's issue". I have actually had the code printing out

System.out.println(i + ": " + splitSource3.get(i) + "*");

in a for each loop afterward.

This is how I knew I had a problem. By the way, the problem has still not been fixed.

UPDATE:

Sample output (minus single quotes):

'0: Olin D. Kirkland                                          '
'1: Sophomore                                          '
'2: Someplace, Virginia  12345<br />VA SomeCity<br />'
'3: Undergraduate                                          '

EDIT the OP rephrased his question at Query about the trim() method in Java, where the issue was found to be Unicode whitespace characters which are not matched by String.trim().

回答1:

It just occurred to me that I used to have this sort of issue when I worked on a screen-scraping project. The key is that sometimes the downloaded HTML sources contain non-printable characters which are non-whitespace characters too. These are very difficult to copy-paste to a browser. I assume that this could happened to you.

If my assumption is correct then you've got two choices:

  1. Use a binary reader and figure out what those characters are - and delete them with String.replace(); E.g.:

    private static void cutCharacters(String fromHtml) {
        String result = fromHtml;
        char[] problematicCharacters = {'\000', '\001', '\003'}; //this could be a private static final constant too
        for (char ch : problematicCharacters) {
            result = result.replace(ch, ""); //I know, it's dirty to modify an input parameter. But it will do as an example
        }
        return result;
    }
    
  2. If you find some sort of reoccurring pattern in the HTML to be parsed then you can use regexes and substrings to cut the unwanted parts. E.g.:

    private String getImportantParts(String fromHtml) {
        Pattern p = Pattern.compile("(\\w*\\s*)"); //this could be a private static final constant as well.
        Matcher m = p.matcher(fromHtml);
        StringBuilder buff = new StringBuilder();
        while (m.find()) {
            buff.append(m.group(1));
        }
        return buff.toString().trim();
    }
    


回答2:

Works without a problem for me.

Here your code a bit refactored and (maybe) better readable:

final String openingTag = "<td class=\"dddefault\">";
final String closingTag = "</td>";
List<String> splitSource2 = new ArrayList<String>();
splitSource2.add(openingTag + "Bob the Builder " + closingTag);
splitSource2.add(openingTag + "Sam the welder " + closingTag);
for (String string : splitSource2) {
    System.out.println("|" + string + "|");
}
List<String> splitSource3 = new ArrayList<String>();
for (String s : splitSource2) {
    if (s.length() > openingTag.length() && s.startsWith(openingTag)) {
        String nameWithoutOpeningTag = s.substring(openingTag.length());
        splitSource3.add(nameWithoutOpeningTag);
    }
}

System.out.println("\n");
for (int i = 0; i < splitSource3.size(); i++) {
    String name = splitSource3.get(i);
    int closingTagBegin = splitSource3.get(i).length() - closingTag.length();
    String nameWithoutClosingTag = name.substring(0, closingTagBegin);
    String nameTrimmed = nameWithoutClosingTag.trim();
    splitSource3.set(i, nameTrimmed);
    System.out.println("|" + splitSource3.get(i) + "|");
}

I know that's not a real answer, but i cannot post comments and this code as a comment wouldn't fit, so I made it an answer, so that Olin Kirkland can check his code.