Special order on String with the power of Java 8

2019-06-14 20:25发布

问题:

I've got a finite list (or array) of strings. This list shall describe my new order by the following means: A string s1 is less than a string s2 iff one of the next three statements holds

  • both are in the list an s1 has a lower index in the list than s2
  • both aren't in the list and s1 is less than s2 by the natural order on strings
  • s1 is in the list and s2 is not

I want to use the power of Java 8 to write a one line Comparator. What is the shortest such line? You can assume my list to be an ArrayList<String>.

Edit: I don't want to order the list I'm talking about. I want to use that list to define an order which allows me to order many other lists.

回答1:

The shortest Comparator definition you can get for your task is:

Comparator.comparingInt((String s) -> reference.indexOf(s)+Integer.MIN_VALUE)
          .thenComparing(Comparator.naturalOrder())

but it requires the reader to understand the +Integer.MIN_VALUE idiom in integer comparisons. It basically means “perform an unsigned comparison” (see also Integer.compareUnsigned(…)), so that the value -1, returned by List.indexOf for absent values, is treated as the highest possible number, so that every index of a present value is considered to be smaller than that, to meet your specification.

Then, you can simply chain another comparator, here the natural order, for the case both indices are the same, which includes the possibility that both are absent.

If you think that the unsigned comparison is to hard to grasp for the reader, you have to encode your three cases explicitly, as already shown in tobias_k’s answer.



回答2:

Use Comparator.comparing to define the key function for the "is-in-list" comparison, and append a thenComparing for the lexicographical ordering.

List<String> reference = Arrays.asList("foo", "bar", "blub");
List<String> toBeSorted = Arrays.asList("foo", "ccc", "AAAA", "bbb", "blub");

Collections.sort(toBeSorted, Comparator
        .comparingInt((String s) -> reference.contains(s) ? reference.indexOf(s) : Integer.MAX_VALUE)
        .thenComparing(Comparator.naturalOrder()));

Or maybe a bit cleaner, using three comparators, for containment, position, and lexicographic:

Collections.sort(toBeSorted, Comparator
        .comparing((String s) -> ! reference.contains(s))
        .thenComparingInt(reference::indexOf)
        .thenComparing(Comparator.naturalOrder()));

(Using just reference::contains does not work here; seems to see it as a Comparator<Object> then, instead of Comparator<String>)

Afterwards, in both cases, toBeSorted is [foo, blub, AAAA, bbb, ccc].


Update: The problem with this approach is that it is pretty wasteful: For each comparison, I first check whether the element is in the list, then iterate the list again to find the index.

You can improve this by using the index + Integer.MIN_VALUE trick as pointed out by Holger (if you like this, please upvote his answer).

Collections.sort(toBeSorted, Comparator
        .comparingInt((String s) -> reference.indexOf(s) + Integer.MIN_VALUE)
        .thenComparing(Comparator.naturalOrder()));

This uses the fact that -1 + Integer.MIN_VALUE produces an integer overflow (or rather underflow), i.e. the result is not Integer.MIN_VALUE - 1 but Integer.MAX_VALUE. In any other situation 'clever hacks' like this are heavily frowned upon, but here it reduces the complexity of the comparison by 50%.

But if speed is an issue, we can do better than that! Even with Holger's trick, the reference list is looped multiple times, at least once for each element in the toBeSorted list (and in fact much more often than that, as the comparator is not caching the values). If the lists are large, then it might be worth creating a hashmap, mapping strings to their position in the reference list, and then using that hashmap (with O(1) lookup time) in the actual sort.

Map<String, Integer> index = IntStream.range(0, reference.size()).boxed()
        .collect(Collectors.toMap(reference::get, Function.identity()));
Collections.sort(toBeSorted, Comparator
        .comparingInt((String s) -> index.getOrDefault(s, Integer.MAX_VALUE))
        .thenComparing(Comparator.naturalOrder()));

So, instead of 2*n*logn times (assuming nlogn comparisons with 2 indexof operations each), this will loop the reference list just once.


Update 2: Instead of manually precalculating the indices for each element in the list, you could also define a generic memoize function, caching each comparison key when it is first calculated:

public static <X, Y> Function<X, Y> memoize(Function<X, Y> function) {
    Map<X, Y> cache = new IdentityHashMap<>();
    return (X x) -> cache.computeIfAbsent(x, function);
}

You can use it like Comparator.comparing(memoize(originalKeyFunction)). You can use this for all sorts of comparisons (I'm wondering why Comparator.comparing isn't doing this by default).

For longer lists, caching the indices (no matter how) has a huge impact on speed:

# Items   Naive     Overflow   Index-Map  Memoize
--------------------------------------------------
     10      0.014     0.0048   0.0082      0.0085
    100      0.4016    0.3272   0.0566      0.085
   1000     50.29     26.12     1.15        2.7
  10000   6441.4    5595.2     18.2       151.6

(Measurements with N items (Integers) in reference list and twice as many in list to be sorted; each sort repeated a number of times (5-10000), showing the average execution time in milliseconds.)