Detect duplicated groups in stream

2019-04-05 11:40发布

问题:

I want to ensure that all numbers in the list are grouped together. Let me explain this on examples:

{1, 1, 1, 2, 2}    // OK, two distinct groups
{1, 1, 2, 2, 1, 1} // Bad, two groups with "1"
{1, 2, 3, 4}       // OK, 4 distinct groups of size 1
{1, 1, 1, 1}       // OK, 1 group
{3, 4, 3}          // Bad, two groups with "3"
{99, -99, 99}      // Bad, two groups with "99"
{}                 // OK, no groups

Here's how I obtain the stream:

IntStream.of(numbers)
    ...

Now I need to pass or return true for "OK" examples and throw AssertionError or return false on "Bad" examples. How can I do that using Stream API?

Here's my current solution with additional Set created:

Set<Integer> previousNumbers = new HashSet<>();
IntStream.of(numbers)
        .reduce(null, (previousNumber, currentNumber) -> {
                    if (currentNumber == previousNumber) {
                        assertThat(previousNumbers).doesNotContain(currentNumber);
                        previousNumbers.add(currentNumber);
                    }
                    return currentNumber;
                }
        );

回答1:

Using my free StreamEx library:

IntStreamEx.of(numbers).boxed().runLengths().toMap();

This code will throw IllegalStateException if there are repeating groups.

Here runLengths() method is used. It collapses equal adjacent elements replacing them with Map.Entry where key is the input element and value is the number of repeats. Finally toMap() is used which is a shortcut for .collect(Collectors.toMap(Entry::getKey, Entry::getValue)). We are using the fact that .toMap() throws IllegalStateException when keys repeat (unless custom mergeFunction is supplied).

As a free bonus on successful execution you will have a map where keys are input elements and values are lengths of series.



回答2:

In my opinion this problem doesn't fit the Stream API at all but I was curious how this could be implemenented (however in a performant way).

The problem is that you have to keep track of seen elements and the whole test should have a short-circuit behaviour. So I came up with this solution (without Streams):

public static boolean hasUniqueGroups(int[] arr) {
    Objects.requireNonNull(arr);
    Set<Integer> seen = new HashSet<>();
    for (int i = 0; i < arr.length; i++) {
        if (i == 0 || arr[i] != arr[i - 1]) {
            if (!seen.add(arr[i])) {
                return false;
            }
        }
    }
    return true;
}

The next step is to introduce the Stream API and the solution looks like this:

public static boolean hasUniqueGroups(int[] arr) {
    Objects.requireNonNull(arr);
    Set<Integer> seen = new HashSet<>();
    return IntStream.range(0, arr.length)
            .filter(i -> i == 0 || arr[i] != arr[i - 1])
            .mapToObj(i -> arr[i])
            .allMatch(seen::add);
}

Note: In order to parallelize this Stream you should use a thread-safe Set.



回答3:

More of an addition to what has been said already, we could try to answer this question using collect method. The problem with this approach (as others have indicated) is that a reduction operations do not terminate quickly.

Generally, to short-circuit a long reduction operation, we can short-circuit the reduction function. This way, although we still iterate through all items in the stream, the amount of work required is minimal.

public static boolean hasUniqueGroups(int... arr) {
    return !IntStream
        .of(arr) 
        .collect(
                Container::new, // 1
                (container, current) -> {
                    if (container.skip) return; // 2
                    if (current != container.previous) {
                        container.previous = current;
                        if (!container.integers.add(current))
                            container.skip = true; // 3
                    }
                },
                (c1, c2) -> {
                    if (c1.skip != c2.skip) {
                        c1.skip = true;
                        c1.integers.addAll(c2.integers);
                    }
                }
        )
        .skip;
}

private static class Container {
    private int previous = MAX_VALUE; // 4
    private boolean skip = false;
    private Set<Integer> integers = new HashSet<>();
}
  1. We create Supplier which will create new Container for each computation. Container (amongst other things) will hold information if we should continue or skip computation.
  2. If at some point we encountered non-unique group, we will skip the entire computation.
  3. If we are currently at the beginning of a new group, we check if it is unique. If not, we decide to skip the rest of the stream.
  4. This is a poor hack to solve the problem when we have sequence {0, 1, 0}. Of course, this solution will not work for i.e. {MAX_VALUE, 0, MAX_VALUE}. I decided to leave this problem for simplicity reason.

We can check the performance by replacing

IntStream.of(arr)

to

IntStream.concat(IntStream.of(1, 2), IntStream.range(1, Integer.MAX_VALUE))

which returns false. This of course will not work for infinite streams, but checking unique groups in infinite stream does not really make sense.