Six years ago, I burned several days trying to hunt down where my perfectly deterministic framework was responding randomly. After meticulously chasing the entire framework ensuring that it was all using the same instance of Random, I then kept chasing by single stepping code. It was highly repetitive iterative self-calling code. Worse, the damn effect would only show up after a huge number of iterations were completed. And after +6 hours, I was finally at wits end when I discovered a line in the javadoc for HashSet.iterator() indicating it doesn't guarantee the order in which it will return elements. I then went through my entire code base and replaced all instances of HashSet with LinkedHashSet. And low-and-behold, my framework sprang right to deterministic life! ARGH!
I have now just experienced this same FREAKIN affect, again (at least it was only 3 hours this time). For whatever reason, I missed the small detail that HashMap happens to BEHAVE THE SAME WAY for its keySet().
Here's an SO thread on this subject, although the discussion never quite answers my question: Iteration order of HashSet
So, I am curious as to why this might occur. Given both times I had a huge single threaded java application crawling through exactly the same instantiation/insertion space with exactly the same JVM parameters (multiple runs from the same batch file) on the same computer with almost nothing else running, what could possibly perturb the JVM such that HashSet and HashMap would, after an enormous number of iterations, behave unpredictably (not inconsistenly as the javadoc says not to depend upon the order)?
Any ideas around this from either the source code (implementation of these classes in java.util) or from your knowledge of the JVM (perhaps some GC affect where internal java classes get non-zeroed memory when allocating internal memory spaces)?
I've struck this before, where the order wasn't important, but did affect the results.
The multi-threaded nature of Java means that repeated runs with exactly the same inputs can be affected by slight timing differences in (for example) how long it takes to allocate a new block of memory, which might sometimes require paging out to disk the previous contents, and at other times that isn't needed. Some other thread not using that page may proceed, and you could end up with a different order of object creation, when System objects are taken into account.
That can affect the Object.hashCode()
result for the equivalent object in different runs of the JVM.
For me, I decided to add the small overhead of using a LinkedHashMap
, in order to be able to reproduce the results of the tests I was running.
Short Answer
There's a tradeoff. If you want amortized constant time O(1) access to elements, the techniques to date rely upon a randomized scheme like hashing. If you want ordered access to elements, the best engineering tradeoff gives you only O(ln(n)) performance. For your case, perhaps this doesn't matter, but the difference between constant time and logarithmic time makes a very big difference starting even with relatively small structures.
So yes, you can go look at the code and inspect carefully, but it boils down to a rather practical theoretical fact. Now is a good time to brush the dust off that copy of Cormen (or Googly Bookiness here) that's propping up the drooping corner of your house's foundation and take a look at Chapters 11 (Hash Tables) and 13 (Red-Black Trees). These will fill you in on the JDK's implementation of HashMap and TreeMap, respectively.
Long Answer
You don't want a Map
or Set
to return ordered lists of keys/members. That's not what they're for. Maps and Sets structures are not ordered just like the underlying mathematical concepts, and they provide different performance. The objective of these data structures (as @thejh points out) is efficient amortized insert
, contains
, and get
time, not maintaining ordering. You can look into how a hashed data structure is maintained to know what the tradeoffs are. Take a look at the Wikipedia entries on Hash Functions and Hash Tables (ironically, note that the Wiki entry for "unordered map" redirects to the latter) or a computer science / data structures text.
Remember: Don't depend on properties of ADTs (and specifically collections) such as ordering, immutability, thread safety or anything else unless you look carefully at what the contract is. Note that for Map, the Javadoc says clearly:
The order of a map is defined as the
order in which the iterators on the
map's collection views return their
elements. Some map implementations,
like the TreeMap class, make specific
guarantees as to their order; others,
like the HashMap class, do not.
And Set.iterator()
has the similar:
Returns an iterator over the elements
in this set. The elements are returned
in no particular order (unless this
set is an instance of some class that
provides a guarantee).
If you want an ordered view of these, use one of the following approaches:
- If it's just a
Set
, maybe you really want a SortedSet
such as a TreeSet
- Use a
TreeMap
, which allows either natural ordering of keys or a specific ordering via Comparator
- Abstract your data structure, which probably is an application-specific thing anyway if this is the behavior you want, and maintain both a
SortedSet
of keys as well as a Map
, which will perform better in amortized time.
- Get the
Map.keySet()
(or just the Set
you're interested in) and put it into a SortedSet
such as TreeSet
, either using the natural ordering or a specific Comparator
.
- Iterate over the
Map.Entry<K,V>
using Map.entrySet().iterator()
, after it has been sorted. E.g. for (final Map.Entry<K,V> entry : new TreeSet(map.entrySet())) { }
to efficiently access both keys and values.
- If you are only doing this once and awhile, you could just get an array of values out of your structure and use
Arrays.sort()
, which has a different performance profile (space and time).
Links to the Source
If you would like to look at the source for j.u.HashSet and j.u.HashMap, they are available on GrepCode. Note that a HashSet is just sugar for a HashMap. Why not always use the sorted versions? Well, as I allude above, the performance differs and that matters in some applications. See the related SO question here. You can also see some concrete performance numbers at the bottom here (I haven't looked closely to verify these are accurate, but they happen to substantiate my point, so I'll blithely pass along the link. :-)
http://download.oracle.com/javase/1.4.2/docs/api/java/lang/Object.html#hashCode() says:
As much as is reasonably practical,
the hashCode method defined by class
Object does return distinct integers
for distinct objects. (This is
typically implemented by converting
the internal address of the object
into an integer, but this
implementation technique is not
required by the JavaTM programming
language.)
So maybe the internal address changes?
This also means that you could propably fix it without giving up speed by writing your own hashCode()
method for everything that should act as a key.
You should NEVER depend on the order of a hash map.
If you want a Map with a deterministic ordering, I suggest you use a SortedMap/SortedSet like TreeMap/TreeSet or use LinkedHashMap/LinkedHashSet. I use the later often, not because the program needs the ordering, but because its easier to read logs/debug the state of the map. i.e. when you add a key, it goes to the end every time.
You can create two HashMap/HashSet with the same elements but get different orders depending on the capacity of the collection. It is possible for subtle differences in how your code runs to trigger a different final bucket size and therefor a different order.
e.g.
public static void main(String... args) throws IOException {
printInts(new HashSet<Integer>(8,2));
printInts(new HashSet<Integer>(16,1));
printInts(new HashSet<Integer>(32,1));
printInts(new HashSet<Integer>(64,1));
}
private static void printInts(HashSet<Integer> integers) {
integers.addAll(Arrays.asList(0,10,20,30,40,50,60,70,80,90,100));
System.out.println(integers);
}
prints
[0, 50, 100, 70, 40, 10, 80, 20, 90, 60, 30]
[0, 50, 100, 70, 80, 20, 40, 10, 90, 60, 30]
[0, 100, 70, 40, 10, 50, 80, 20, 90, 60, 30]
[0, 70, 10, 80, 20, 90, 30, 100, 40, 50, 60]
Here you have HashSet(s) with the same values added in the same order resulting in different iterator orders. You may not be playing with the constructor, but your application could cause a different bucket size indirectly.