I've got a finite list (or array) of strings. This list shall describe my new order by the following means: A string s1
is less than a string s2
iff one of the next three statements holds
- both are in the list an
s1
has a lower index in the list than s2
- both aren't in the list and
s1
is less than s2
by the natural order on strings
s1
is in the list and s2
is not
I want to use the power of Java 8 to write a one line Comparator. What is the shortest such line? You can assume my list to be an ArrayList<String>
.
Edit: I don't want to order the list I'm talking about. I want to use that list to define an order which allows me to order many other lists.
The shortest Comparator
definition you can get for your task is:
Comparator.comparingInt((String s) -> reference.indexOf(s)+Integer.MIN_VALUE)
.thenComparing(Comparator.naturalOrder())
but it requires the reader to understand the +Integer.MIN_VALUE
idiom in integer comparisons. It basically means “perform an unsigned comparison” (see also Integer.compareUnsigned(…)
), so that the value -1
, returned by List.indexOf
for absent values, is treated as the highest possible number, so that every index of a present value is considered to be smaller than that, to meet your specification.
Then, you can simply chain another comparator, here the natural order, for the case both indices are the same, which includes the possibility that both are absent.
If you think that the unsigned comparison is to hard to grasp for the reader, you have to encode your three cases explicitly, as already shown in tobias_k’s answer.
Use Comparator.comparing
to define the key function for the "is-in-list" comparison, and append a thenComparing
for the lexicographical ordering.
List<String> reference = Arrays.asList("foo", "bar", "blub");
List<String> toBeSorted = Arrays.asList("foo", "ccc", "AAAA", "bbb", "blub");
Collections.sort(toBeSorted, Comparator
.comparingInt((String s) -> reference.contains(s) ? reference.indexOf(s) : Integer.MAX_VALUE)
.thenComparing(Comparator.naturalOrder()));
Or maybe a bit cleaner, using three comparators, for containment, position, and lexicographic:
Collections.sort(toBeSorted, Comparator
.comparing((String s) -> ! reference.contains(s))
.thenComparingInt(reference::indexOf)
.thenComparing(Comparator.naturalOrder()));
(Using just reference::contains
does not work here; seems to see it as a Comparator<Object>
then, instead of Comparator<String>
)
Afterwards, in both cases, toBeSorted
is [foo, blub, AAAA, bbb, ccc]
.
Update: The problem with this approach is that it is pretty wasteful: For each comparison, I first check whether the element is in the list, then iterate the list again to find the index.
You can improve this by using the index + Integer.MIN_VALUE
trick as pointed out by Holger (if you like this, please upvote his answer).
Collections.sort(toBeSorted, Comparator
.comparingInt((String s) -> reference.indexOf(s) + Integer.MIN_VALUE)
.thenComparing(Comparator.naturalOrder()));
This uses the fact that -1 + Integer.MIN_VALUE
produces an integer overflow (or rather underflow), i.e. the result is not Integer.MIN_VALUE - 1
but Integer.MAX_VALUE
. In any other situation 'clever hacks' like this are heavily frowned upon, but here it reduces the complexity of the comparison by 50%.
But if speed is an issue, we can do better than that! Even with Holger's trick, the reference
list is looped multiple times, at least once for each element in the toBeSorted
list (and in fact much more often than that, as the comparator is not caching the values). If the lists are large, then it might be worth creating a hashmap, mapping strings to their position in the reference list, and then using that hashmap (with O(1) lookup time) in the actual sort.
Map<String, Integer> index = IntStream.range(0, reference.size()).boxed()
.collect(Collectors.toMap(reference::get, Function.identity()));
Collections.sort(toBeSorted, Comparator
.comparingInt((String s) -> index.getOrDefault(s, Integer.MAX_VALUE))
.thenComparing(Comparator.naturalOrder()));
So, instead of 2*n*logn times (assuming nlogn comparisons with 2 indexof operations each), this will loop the reference list just once.
Update 2: Instead of manually precalculating the indices for each element in the list, you could also define a generic memoize
function, caching each comparison key when it is first calculated:
public static <X, Y> Function<X, Y> memoize(Function<X, Y> function) {
Map<X, Y> cache = new IdentityHashMap<>();
return (X x) -> cache.computeIfAbsent(x, function);
}
You can use it like Comparator.comparing(memoize(originalKeyFunction))
. You can use this for all sorts of comparisons (I'm wondering why Comparator.comparing
isn't doing this by default).
For longer lists, caching the indices (no matter how) has a huge impact on speed:
# Items Naive Overflow Index-Map Memoize
--------------------------------------------------
10 0.014 0.0048 0.0082 0.0085
100 0.4016 0.3272 0.0566 0.085
1000 50.29 26.12 1.15 2.7
10000 6441.4 5595.2 18.2 151.6
(Measurements with N items (Integers) in reference list and twice as many in list to be sorted; each sort repeated a number of times (5-10000), showing the average execution time in milliseconds.)