Collect HashSet / Java 8 / Regex Pattern / Stream

2019-02-03 11:13发布

问题:

Recently I change version of the JDK 8 instead 7 of my project and now I overwrite some code snippets using new features that came with Java 8.

final Matcher mtr = Pattern.compile(regex).matcher(input);

HashSet<String> set = new HashSet<String>() {{
    while (mtr.find()) add(mtr.group().toLowerCase());
}};

How I can write this code using Stream API ?

回答1:

A Matcher-based spliterator implementation can be quite simple if you reuse the JDK-provided Spliterators.AbstractSpliterator:

public class MatcherSpliterator extends AbstractSpliterator<String[]>
{
  private final Matcher m;

  public MatcherSpliterator(Matcher m) {
    super(Long.MAX_VALUE, ORDERED | NONNULL | IMMUTABLE);
    this.m = m;
  }

  @Override public boolean tryAdvance(Consumer<? super String[]> action) {
    if (!m.find()) return false;
    final String[] groups = new String[m.groupCount()+1];
    for (int i = 0; i <= m.groupCount(); i++) groups[i] = m.group(i);
    action.accept(groups);
    return true;
  }
}

Note that the spliterator provides all matcher groups, not just the full match. Also note that this spliterator supports parallelism because AbstractSpliterator implements a splitting policy.

Typically you will use a convenience stream factory:

public static Stream<String[]> matcherStream(Matcher m) {
  return StreamSupport.stream(new MatcherSpliterator(m), false);
}

This gives you a powerful basis to concisely write all kinds of complex regex-oriented logic, for example:

private static final Pattern emailRegex = Pattern.compile("([^,]+?)@([^,]+)");
public static void main(String[] args) {
  final String emails = "kid@gmail.com, stray@yahoo.com, miks@tijuana.com";
  System.out.println("User has e-mail accounts on these domains: " +
      matcherStream(emailRegex.matcher(emails))
      .map(gs->gs[2])
      .collect(joining(", ")));
}

Which prints

User has e-mail accounts on these domains: gmail.com, yahoo.com, tijuana.com

For completeness, your code will be rewritten as

Set<String> set = matcherStream(mtr).map(gs->gs[0].toLowerCase()).collect(toSet());


回答2:

Marko's answer demonstrates how to get matches into a stream using a Spliterator. Well done, give that man a big +1! Seriously, make sure you upvote his answer before you even consider upvoting this one, since this one is entirely derivative of his.

I have only a small bit to add to Marko's answer, which is that instead of representing the matches as an array of strings (with each array element representing a match group), the matches are better represented as a MatchResult which is a type invented for this purpose. Thus the result would be a Stream<MatchResult> instead of Stream<String[]>. The code gets a little simpler, too. The tryAdvance code would be

    if (m.find()) {
        action.accept(m.toMatchResult());
        return true;
    } else {
        return false;
    }

The map call in his email-matching example would change to

    .map(mr -> mr.group(2))

and the OP's example would be rewritten as

Set<String> set = matcherStream(mtr)
                      .map(mr -> mr.group(0).toLowerCase())
                      .collect(toSet());

Using MatchResult gives a bit more flexibility in that it also provides offsets of match groups within the string, which could be useful for certain applications.



回答3:

I don't think you can turn this into a Stream without writing your own Spliterator, but, I don't know why you would want to.

Matcher.find() is a state changing operation on the Matcher object so running each find() in a parallel stream would produce inconsistent results. Running the stream in serial wouldn't have better performance that the Java 7 equivalent and would be harder to understand.



回答4:

What about Pattern.splitAsStream ?

Stream<String> stream = Pattern.compile(regex).splitAsStream(input);

and then a collector to get a set.

Set<String> set = stream.map(String::toLowerCase).collect(Collectors.toSet());


回答5:

What about

public class MakeItSimple {

public static void main(String[] args) throws FileNotFoundException  {

    Scanner s = new Scanner(new File("C:\\Users\\Admin\\Desktop\\TextFiles\\Emails.txt"));

    HashSet<String> set = new HashSet<>();          
    while ( s.hasNext()) {
       String r = s.next();
       if (r.matches("([^,]+?)@([^,]+)")) {
          set.add(r);
       }
    }   
    set.stream().map( x -> x.toUpperCase()).forEach(x -> print(x)); 
    s.close();
  }
}


回答6:

Here is the implementation using Spliterator interface.

    // To get the required set
   Set<String> result = (StreamSupport.stream(new MatcherGroupIterator(pattern,input ),false))
           .map( s -> s.toLowerCase() )
           .collect(Collectors.toSet());
    ...
    private static class MatcherGroupIterator implements Spliterator<String> {
      private final Matcher matcher;

      public MatcherGroupIterator(Pattern p, String s) {
        matcher = p.matcher(s);
      }

      @Override
      public boolean tryAdvance(Consumer<? super String> action) {
        if (!matcher.find()){
            return false;
        }
        action.accept(matcher.group());
        return true;
      }

      @Override
      public Spliterator<String> trySplit() {
        return null;
      }

      @Override
      public long estimateSize() {
        return Long.MAX_VALUE;
      }

      @Override
      public int characteristics() {
        return Spliterator.NONNULL;
      }
  }