This post is divided to two major sections. The first section introduces the original test cases & results, and my thoughts about it. The second section details the modified test case, and its results.
The original title of this topic was "Full iteration over an array significantly faster than with a linked list". The title was changed due to the newer test results (presented in Section Two).
SECTION ONE: The Original Test Case
For a full one-directional sequential traversal, it's known that the linked list and the array have similar performance, but due to the cache-friendliness (reference locality) of the contiguous array, it may perform (slightly) better. To see how it works in the practice (Android, Java), I examined the above claims, and made some measurements.
First of all, my naive assumptions. Let's take a look at the following class:
private static class Message {
public int value1;
public Message next;
public Message(java.util.Random r, Message nextmsg) {
value1 = r.nextInt();
next = nextmsg;
}
}
In the first measurement scenario, its next
field will not be used at all. The below code creates an array of 1,000,000 Message
instances, and then iterates over the array in a loop. It measures how much time the iteration takes.
Log.i("TEST", "Preparing...");
final int cnt = 1000000;
int val = 0;
java.util.Random r = new java.util.Random();
Message[] messages = new Message[cnt];
for (int i = 0; i < cnt; i++) {
messages[i] = new Message(r, null);
}
Log.i("TEST", "Starting iterating...");
long start = SystemClock.uptimeMillis();
for (int i = 0; i < cnt; i++) {
Message msg = messages[i];
if (msg.value1 > 564645) {
val++;
}
}
Log.w("TEST", "Time: " + (SystemClock.uptimeMillis() - start) + ", result: " + val);
The second measurement builds and measures a linked list of Message
objects instead:
Log.i("TEST", "Preparing...");
final int cnt = 1000000;
int val = 0;
java.util.Random r = new java.util.Random();
Message current = null;
Message previous = null;
for (int i = 0; i < cnt; i++) {
current = new Message(r, previous);
previous = current;
}
previous = null;
Log.i("TEST", "Starting iterating...");
long start = SystemClock.uptimeMillis();
while (current != null) {
if (current.value1 > 564645) {
val++;
}
current = current.next;
}
Log.w("TEST","Time: " + (SystemClock.uptimeMillis() - start) + ", result: " + val);
The first test constantly produces 41-44 ms, while the second test gives 80-85 ms. The linked list iteration seems to be 100% slower.
My (possibly flawed) train of thought and issues is as follows. I'll welcome (in fact, encourage) any corrections.
OK, we can often read that an array is a contiguous memory block, and thus accessing its elements sequentially is more cache-friendly than in case of a linked list. But in our case, the elements of the array are only object references, and not Message
objects themselves (in Java, we don't have value type i.e. struct as in C# that we could store embedded in an array). Therefore, the "locality of reference" only applies to array elements themselves, and these only specify the address of objects. Consequently, the Message
instances (in general) could still be "anywhere" in the memory, so the "locality of reference" doesn't apply to the instances themselves. From this point, it looks like we are the same as in case of a linked list: the instances themselves may reside "anywhere" in memory: the array only guarantees that their references are stored in a contiguous block...
...and here comes the use-case: complete sequential traversal (iteration). First, let's examine how we get the references to instances in each case. In case of the array, it's very efficient, because they're in a contiguous block. But in case of the linked list, we're also good, because once we've accessed a Message
instance (that's why we iterate!), we immediately have the reference to the next instance. And since we've accessed a field of Message
already, accessing another field (the "next") should be supported by the cache (the fields of the same object also have a locality of references AFAIK, they're in a contiguous block, too). To sum up, it seems to break down to this:
- The array offers a cache-friendly iteration over references.
Message
instances themselves may be "anywhere" in memory, and we need to visit these, too. - The linked list offers that the reference to the next element is obtained when the current
Message
instance is accessed. This is "free", because eachMessage
instance must be visited anyway (just like in the array case).
So, based on the above, it looks like the array is not better than the linked list. The only exception is when the array is of primitive type (but in such a case, it wouldn't make sense to compare it with a linked list). So I'd expect that they perform similarly, but they didn't, as there was a huge difference. In fact, if we assume that the array indexing requires a range check each time an element is accessed, the linked list (in theory) could be faster, even. (The range check of the array access is probably optimized away by the JIT, so I understand that this is not a valid point.)
My guesses are the following:
It's probably not the array's cache-friendliness that is responsible for the 100% difference. Instead, the JIT performs optimizations that can't be done in case of the linked list traversal. If the range check and (VM-level) null check are eliminated, then I guess the "array-get" bytecode instruction might be faster than my "field-get" (or whatever it's called) instruction in the linked list case (?).
Even though the
Message
instances could be "anywhere" in memory, they are probably very close to each other, because they were allocated "at the same time". But 1,000,000 instances can't be cached, only a part of them. In such a case, sequential access would be cache-friendly both in the array and in the linked list case, so this doesn't explain the difference.Some intelligent "prediction" (prefetch) of the
Message
instance I'll access? I.e. somehow there is still cache-friendliness with theMessage
instances themselves, but only in case of the array access.
UPDATE: Since several comments were received, I'd like to react to them below.
@irreputable:
the linked list is visited from high address to low address. what if it's the other way around, i.e. next points to a newer object, not a previous object
Very good spot! I didn't think to this little detail that the layout may influence the test. I'll test it today and will return with the results. (Edit: results are here, I've updated this post with "Section 2").
@Torben comments:
Also I would say that this whole exercise seems pretty useless. You are talking about 4ms improvement over 100000 iterations. Seems like premature optimization. If you have a situation where this is a bottleneck, then please describe it and we can look into it (because it definitely would be a more interesting problem than this).
If it's not interesting to you, then you can disregard this topic (instead of posting 4 times). About your baseless assumption of "premature optimization" -- I'm afraid you read too much SO and perform too little industrial-level development. The concrete situation is in a simulation-related software which might have to traverse these lists several times per second. Indeed, 120+ ms latency may affect the responsiveness of an app.
I appreciate the thought you put into this, but I really can't find a question from your post. :) Edit: And array iteration is 50% faster. 100% faster would mean zero time consumed.
I'm sure it was pretty obvious from my post that I'm wondering why is the very significant difference present, when the arguments would imply otherwise. Thanks for the correction: indeed, I wanted to write that the linked list case is 100% slower.
SECTION TWO: The Modified Test Cases
irreputable had a very interesting observation for me:
the linked list is visited from high address to low address. what if it's the other way around, i.e. next points to a newer object, not a previous object
I changed the linked list structure such that the direction of its next
pointers is equal to the instantiation order of its nodes:
Message current = null;
Message previous = new Message(r, null);
Message first = previous;
for (int i = 0; i < cnt; i++) {
current = new Message(r, null);
previous.next = current;
previous = current;
}
previous = current = null;
(Note that the creation algorithm might not be the most compact one, I think I knew a slightly nicer way.) The code that iterates through this linked list:
while (first != null) {
if (first.value1 > 564645) {
val++;
}
first = first.next;
}
And now the result I get is constantly 37-39 ms (OK, we can say it's exactly the array's performance, but actually, it's slightly faster in every test case, constantly.) Instead of the 80 ms of the "reversed-direction" linked list, it's twice as fast!
Then I made a similar test with the original array test case as well: I changed the array traversal to opposite direction (to a count-down loop):
for (int i = cnt - 1; i >= 0; i--) {
Message msg = messages[i];
if (msg.value1 > 564645) {
val++;
}
}
And the result is constantly 85-90 ms! The original test case produced 40-41 ms.
It seems there are two new conclusions (and one question) now:
The original claim seems to be true that the array's "locality of reference" (due to the contigous memory block) doesn't provide an advantage in case of "reference-type" (i.e. object) arrays when they're compared with linked lists. This is because the object arrays only hold references to object instances, not the object instances themselves (which can be, in theory, "anywhere" in memory, just like in case of the linked list).
In my test cases, the result seems to be dependent on the direction of traversal, even in case of the array scenario (!). How is this possible?
To sum up my test results:
In "forward" direction traversal, the linked list slightly outperforms the array traversal (exactly as expected: we have the next reference immediately when a
Message
instance is obtained, i.e. even there is no need to access an array element for obtaining its address).In "backward" direction traversal, both have about 100% weaker performance (and the linked list also slightly outperforms the array).
Any ideas?
UPDATE 1: dlthorpe made very valuable comments. I'll copy them here, as they might help in finding an answer to this "riddle".
Is there any indication that the hardware implements a look-ahead page prefetch in the memory cache controller? Instead of loading only the memory page needed for a memory reference, also load the next higher page in anticipation of a forward progressive read? This would eliminate page load waits for forward progressions through memory but would not eliminate page load waits for reverse progressions through memory.
[..]
I'd suggest testing on radically different hardware. Most mobile devices are running some form of ARM SoC. See if the test cases show similar skew on Intel hardware, like a PC or a Mac. If you can dig up an old PowerPC Mac, even better. If these don't show similar results, then that would point to something unique on the ARM platform or its Java implementation.
[..]
Right, your access patterns are mostly sequential, but in different directions. If something underneath you is doing prefetch but only in one direction (prefetch next higher address block), then that will skew the results in favor of the tests that run in that direction.
UPDATE 2: I ran the tests on a PC (Core i7 Nehalem architecture from Feb. 2009, 8 GB RAM, Windows 7). I used C#.NET in a .NET 2.0 source code project (but the .NET 4 is installed on the machine). My results, with 25 million Message
instances:
- Linked list: 57-60 ms
- Array: 60-63 ms
The direction of reading didn't seem to influence the results.