Take string input, parse each word to all lowercas

2019-08-27 01:00发布

I'm trying to take a string input, parse each word to all lowercase and print each word on a line (in sorted order), ignoring non-alphabetic characters (single letter words count as well). So,

Sample input:

Adventures in Disneyland

Two blondes were going to Disneyland when they came to a fork in the
road. The sign read: "Disneyland Left."

So they went home.

Output:

a
adventures
blondes
came
disneyland
fork
going
home
in
left
read
road
sign
so
the
they
to
two
went
were
when

My program:

        Scanner reader = new Scanner(file);
        ArrayList<String> words = new ArrayList<String>();
        while (reader.hasNext()) {
            String word = reader.next();
            if (word != "") {
                word = word.toLowerCase();
                word = word.replaceAll("[^A-Za-z ]", "");
                if (!words.contains(word)) {
                    words.add(word);
                }
            }
        }
        Collections.sort(words);
        for (int i = 0; i < words.size(); i++) {
            System.out.println(words.get(i));
        }

This works for the input above, but prints the wrong output for an input like this:

a  t\|his@ is$ a)( -- test's-&*%$#-`case!@|?

The expected output should be

a
case
his
is
s
t
test

The output I get is

*a blank line is printed first*
a
is
testscase
this

So, my program obviously doesn't work since scanner.next() takes in characters until it hits a whitespace and considers that a string, whereas anything that is not a letter should be treated as a break between words. I'm not sure how I might be able to manipulate Scanner methods so that breaks are considered non-alphabetic characters as opposed to whitespace, so that's where I'm stuck right now.

2条回答
smile是对你的礼貌
2楼-- · 2019-08-27 02:03

The other answer has already mentioned some issues with your code.

I suggest another approach to address your requirements. Such transformations are a good use case for Java Streams – it often yields clean code:

List<String> strs = Arrays.stream(input.split("[^A-Za-Z]+"))
    .map(t -> t.toLowerCase())
    .distinct()
    .sorted()
    .collect(Collectors.toList());

Here are the steps:

  1. Split the string by one or more subsequent characters not being alphabetic;

    input.split("[^A-Za-Z]+")
    

    This yields tokens consistint solely of alphabetic characters.

  2. Stream over the resulting array using Arrays.stream();

  3. Map each element to their lowercase equivalent:

    .map(t -> t.toLowerCase())
    

    The default locale is used. Use toLowerCase(Locale) to explicitly set the locale.

  4. Discard duplicates using Stream.distinct().

  5. Sort the elements within the stream by simply calling sorted();

  6. Collect the elements into a List with collect().


If you need to read it from a file, you could use this:

Files.lines(filepath)
    .flatMap(line -> Arrays.stream(line.split("[^A-Za-Z]+")))
    .map(... // Et cetera

But if you need to use a Scanner, then you could be using something like this:

Scanner s = new Scanner(input)
    .useDelimiter("[^A-Za-z]+");
List<String> parts = new ArrayList<>();
while (s.hasNext()) {
    parts.add(s.next());
}

And then

List<String> strs = parts.stream()
    .map(... // Et cetera
查看更多
相关推荐>>
3楼-- · 2019-08-27 02:05

Don't use == or != for comparing String(s). Also, perform your transform before you check for empty. This,

if (word != "") {
    word = word.toLowerCase();
    word = word.replaceAll("[^A-Za-z ]", "");
    if (!words.contains(word)) {
        words.add(word);
    }
}

should look something like

word = word.toLowerCase().replaceAll("[^a-z ]", "").trim();
if (!word.isEmpty() && !words.contains(word)) {
    words.add(word);
}
查看更多
登录 后发表回答