This question comes as a companion of this one that regarded fastest sorting of a double array.
Now I want to get the top-k
indices corresponding to the unsorted array.
I have implemented this version which (unfortunately) uses autoboxing and HashMap
as proposed in some answers including this one:
HashMap<Double, Integer> map = new HashMap<Double, Integer>();
for(int i = 0; i < numClusters; i++) {
map.put(scores[i], i);
}
Arrays.sort(scores);
HashSet<Integer> topPossibleClusters = new HashSet<Integer>();
for(int i = 0; i < numClusters; i++) {
topPossibleClusters.add(map.get(scores[numClusters - (i+1)]));
}
As you can see this uses a HashMap
with keys the Double
values of the original array and as values the indices of the original array.
So, after sorting the original array I just retrieve it from the map
.
I also use HashSet
as I am interested in deciding if an int
is included in this set, using .contains()
method. (I don't know if this makes a difference since as I mentioned in the other question my arrays are small -50 elements-). If this does not make a difference point it out though.
I am not interested in the value per se, only the indices.
My question is whether there is a faster approach to go with it?
This sort of interlinking/interlocking collections lends itself to fragile, easily broken, hard to debug, unmaintainable code.
Instead create an object:
class Data {
double value;
int originalIndex;
}
Create an array of Data objects storing the original value and index.
Sort them using a custom comparator that looks at data.value and sorts descending.
Now the top X items in your array are the ones you want and you can just look at the value
and originalIndex
as you need them.
As Tim points out linking a multiple collections is rather errorprone. I would suggest using a TreeMap
as this would allow for a standalone solution.
Lets say you have double[] data
, first copy it to a TreeMap
:
final TreeMap<Double, Integer> dataWithIndex = new TreeMap<>();
for(int i = 0; i < data.length; ++i) {
dataWithIndex.put(data[i], i);
}
N.B. You can declare dataWithIndex
as a NavigableMap
to be less specific, but it's so much longer and it doesn't really add much as there is only one implementation in the JDK.
This will populate the Map
in O(n lg n)
time as each put
is O(lg n)
- this is the same complexity as sorting. In reality it will be probably be a little slower, but it will scale in the same way.
Now, say you need the first k
elements, you need to first find the k
th element - this is O(k)
:
final Iterator<Double> keyIter = dataWithIndex.keySet().iterator();
double kthKey;
for (int i = 0; i < k; ++i) {
kthKey = keyIter.next();
}
Now you just need to get the sub-map that has all the entries upto the kth entry:
final Map<Double, Integer> topK = dataWithIndex.headMap(kthKey, true);
If you only need to do this once, then with Java 8 you can do something like this:
List<Entry<Double, Integer>> topK = IntStream.range(0, data.length).
mapToObj(i -> new SimpleEntry<>(data[i], i)).
sorted(comparing(Entry::getKey)).
limit(k).
collect(toList());
i.e. take an IntStream
for the indices of data
and mapToObj
to an Entry
of the data[i] => i
(using the AbsractMap.SimpleEntry
implementation). Now sort that using Entry::getKey
and limit the size of the Stream
to k
entries. Now simply collect the result to a List
. This has the advantage of not clobbering duplicate entries in the data
array.
It is almost exactly what Tim suggests in his answer, but using an existing JDK class.
This method is also O(n lg n)
. The catch is that if the TreeMap
approach is reused then it's O(n lg n)
to build the Map
but only O(k)
to reuse it. If you want to use the Java 8 solution with reuse then you can do:
List<Entry<Double, Integer>> sorted = IntStream.range(0, data.length).
mapToObj(i -> new SimpleEntry<>(data[i], i)).
sorted(comparing(Entry::getKey)).
collect(toList());
i.e. don't limit the size to k
elements. Now, to get the first k
elements you just need to do:
List<Entry<Double, Integer>> subList = sorted.subList(0, k);
The magic of this is that it's O(1)
.