What causes the slightly unpredictable ordering of

2019-01-23 17:45发布

问题:

Six years ago, I burned several days trying to hunt down where my perfectly deterministic framework was responding randomly. After meticulously chasing the entire framework ensuring that it was all using the same instance of Random, I then kept chasing by single stepping code. It was highly repetitive iterative self-calling code. Worse, the damn effect would only show up after a huge number of iterations were completed. And after +6 hours, I was finally at wits end when I discovered a line in the javadoc for HashSet.iterator() indicating it doesn't guarantee the order in which it will return elements. I then went through my entire code base and replaced all instances of HashSet with LinkedHashSet. And low-and-behold, my framework sprang right to deterministic life! ARGH!

I have now just experienced this same FREAKIN affect, again (at least it was only 3 hours this time). For whatever reason, I missed the small detail that HashMap happens to BEHAVE THE SAME WAY for its keySet().

Here's an SO thread on this subject, although the discussion never quite answers my question: Iteration order of HashSet

So, I am curious as to why this might occur. Given both times I had a huge single threaded java application crawling through exactly the same instantiation/insertion space with exactly the same JVM parameters (multiple runs from the same batch file) on the same computer with almost nothing else running, what could possibly perturb the JVM such that HashSet and HashMap would, after an enormous number of iterations, behave unpredictably (not inconsistenly as the javadoc says not to depend upon the order)?

Any ideas around this from either the source code (implementation of these classes in java.util) or from your knowledge of the JVM (perhaps some GC affect where internal java classes get non-zeroed memory when allocating internal memory spaces)?

回答1:

I've struck this before, where the order wasn't important, but did affect the results.

The multi-threaded nature of Java means that repeated runs with exactly the same inputs can be affected by slight timing differences in (for example) how long it takes to allocate a new block of memory, which might sometimes require paging out to disk the previous contents, and at other times that isn't needed. Some other thread not using that page may proceed, and you could end up with a different order of object creation, when System objects are taken into account.

That can affect the Object.hashCode() result for the equivalent object in different runs of the JVM.

For me, I decided to add the small overhead of using a LinkedHashMap, in order to be able to reproduce the results of the tests I was running.



回答2:

Short Answer

There's a tradeoff. If you want amortized constant time O(1) access to elements, the techniques to date rely upon a randomized scheme like hashing. If you want ordered access to elements, the best engineering tradeoff gives you only O(ln(n)) performance. For your case, perhaps this doesn't matter, but the difference between constant time and logarithmic time makes a very big difference starting even with relatively small structures.

So yes, you can go look at the code and inspect carefully, but it boils down to a rather practical theoretical fact. Now is a good time to brush the dust off that copy of Cormen (or Googly Bookiness here) that's propping up the drooping corner of your house's foundation and take a look at Chapters 11 (Hash Tables) and 13 (Red-Black Trees). These will fill you in on the JDK's implementation of HashMap and TreeMap, respectively.

Long Answer

You don't want a Map or Set to return ordered lists of keys/members. That's not what they're for. Maps and Sets structures are not ordered just like the underlying mathematical concepts, and they provide different performance. The objective of these data structures (as @thejh points out) is efficient amortized insert, contains, and get time, not maintaining ordering. You can look into how a hashed data structure is maintained to know what the tradeoffs are. Take a look at the Wikipedia entries on Hash Functions and Hash Tables (ironically, note that the Wiki entry for "unordered map" redirects to the latter) or a computer science / data structures text.

Remember: Don't depend on properties of ADTs (and specifically collections) such as ordering, immutability, thread safety or anything else unless you look carefully at what the contract is. Note that for Map, the Javadoc says clearly:

The order of a map is defined as the order in which the iterators on the map's collection views return their elements. Some map implementations, like the TreeMap class, make specific guarantees as to their order; others, like the HashMap class, do not.

And Set.iterator() has the similar:

Returns an iterator over the elements in this set. The elements are returned in no particular order (unless this set is an instance of some class that provides a guarantee).

If you want an ordered view of these, use one of the following approaches:

  • If it's just a Set, maybe you really want a SortedSet such as a TreeSet
  • Use a TreeMap, which allows either natural ordering of keys or a specific ordering via Comparator
  • Abstract your data structure, which probably is an application-specific thing anyway if this is the behavior you want, and maintain both a SortedSet of keys as well as a Map, which will perform better in amortized time.
  • Get the Map.keySet() (or just the Set you're interested in) and put it into a SortedSet such as TreeSet, either using the natural ordering or a specific Comparator.
  • Iterate over the Map.Entry<K,V> using Map.entrySet().iterator(), after it has been sorted. E.g. for (final Map.Entry<K,V> entry : new TreeSet(map.entrySet())) { } to efficiently access both keys and values.
  • If you are only doing this once and awhile, you could just get an array of values out of your structure and use Arrays.sort(), which has a different performance profile (space and time).

Links to the Source

If you would like to look at the source for j.u.HashSet and j.u.HashMap, they are available on GrepCode. Note that a HashSet is just sugar for a HashMap. Why not always use the sorted versions? Well, as I allude above, the performance differs and that matters in some applications. See the related SO question here. You can also see some concrete performance numbers at the bottom here (I haven't looked closely to verify these are accurate, but they happen to substantiate my point, so I'll blithely pass along the link. :-)



回答3:

http://download.oracle.com/javase/1.4.2/docs/api/java/lang/Object.html#hashCode() says:

As much as is reasonably practical, the hashCode method defined by class Object does return distinct integers for distinct objects. (This is typically implemented by converting the internal address of the object into an integer, but this implementation technique is not required by the JavaTM programming language.)

So maybe the internal address changes?

This also means that you could propably fix it without giving up speed by writing your own hashCode() method for everything that should act as a key.



回答4:

You should NEVER depend on the order of a hash map.

If you want a Map with a deterministic ordering, I suggest you use a SortedMap/SortedSet like TreeMap/TreeSet or use LinkedHashMap/LinkedHashSet. I use the later often, not because the program needs the ordering, but because its easier to read logs/debug the state of the map. i.e. when you add a key, it goes to the end every time.

You can create two HashMap/HashSet with the same elements but get different orders depending on the capacity of the collection. It is possible for subtle differences in how your code runs to trigger a different final bucket size and therefor a different order.

e.g.

public static void main(String... args) throws IOException {
    printInts(new HashSet<Integer>(8,2));
    printInts(new HashSet<Integer>(16,1));
    printInts(new HashSet<Integer>(32,1));
    printInts(new HashSet<Integer>(64,1));
}

private static void printInts(HashSet<Integer> integers) {
    integers.addAll(Arrays.asList(0,10,20,30,40,50,60,70,80,90,100));
    System.out.println(integers);
}

prints

[0, 50, 100, 70, 40, 10, 80, 20, 90, 60, 30]
[0, 50, 100, 70, 80, 20, 40, 10, 90, 60, 30]
[0, 100, 70, 40, 10, 50, 80, 20, 90, 60, 30]
[0, 70, 10, 80, 20, 90, 30, 100, 40, 50, 60]

Here you have HashSet(s) with the same values added in the same order resulting in different iterator orders. You may not be playing with the constructor, but your application could cause a different bucket size indirectly.