Array/Linked list: performance depends on the *dir

2019-03-15 05:39发布

问题:

This post is divided to two major sections. The first section introduces the original test cases & results, and my thoughts about it. The second section details the modified test case, and its results.

The original title of this topic was "Full iteration over an array significantly faster than with a linked list". The title was changed due to the newer test results (presented in Section Two).


SECTION ONE: The Original Test Case

For a full one-directional sequential traversal, it's known that the linked list and the array have similar performance, but due to the cache-friendliness (reference locality) of the contiguous array, it may perform (slightly) better. To see how it works in the practice (Android, Java), I examined the above claims, and made some measurements.

First of all, my naive assumptions. Let's take a look at the following class:

private static class Message {
    public int value1;
    public Message next;

    public Message(java.util.Random r, Message nextmsg) {
        value1 = r.nextInt();
        next = nextmsg;
    }
}

In the first measurement scenario, its next field will not be used at all. The below code creates an array of 1,000,000 Message instances, and then iterates over the array in a loop. It measures how much time the iteration takes.

Log.i("TEST", "Preparing...");          

final int cnt = 1000000;
int val = 0;
java.util.Random r = new java.util.Random();
Message[] messages = new Message[cnt];
for (int i = 0; i < cnt; i++) {
    messages[i] = new Message(r, null);
}           

Log.i("TEST", "Starting iterating...");
long start = SystemClock.uptimeMillis();

for (int i = 0; i < cnt; i++) {
    Message msg = messages[i];
    if (msg.value1 > 564645) {
        val++;
    }
}       
Log.w("TEST", "Time: " + (SystemClock.uptimeMillis() - start) + ", result: " + val);

The second measurement builds and measures a linked list of Message objects instead:

Log.i("TEST", "Preparing...");          

final int cnt = 1000000;
int val = 0;
java.util.Random r = new java.util.Random();
Message current = null;
Message previous = null;
for (int i = 0; i < cnt; i++) {
    current = new Message(r, previous);
    previous = current;
}
previous = null;

Log.i("TEST", "Starting iterating...");
long start = SystemClock.uptimeMillis();
while (current != null) {
    if (current.value1 > 564645) {
        val++;
    }
    current = current.next;
}           

Log.w("TEST","Time: " + (SystemClock.uptimeMillis() - start) + ", result: " + val);

The first test constantly produces 41-44 ms, while the second test gives 80-85 ms. The linked list iteration seems to be 100% slower.

My (possibly flawed) train of thought and issues is as follows. I'll welcome (in fact, encourage) any corrections.

OK, we can often read that an array is a contiguous memory block, and thus accessing its elements sequentially is more cache-friendly than in case of a linked list. But in our case, the elements of the array are only object references, and not Message objects themselves (in Java, we don't have value type i.e. struct as in C# that we could store embedded in an array). Therefore, the "locality of reference" only applies to array elements themselves, and these only specify the address of objects. Consequently, the Message instances (in general) could still be "anywhere" in the memory, so the "locality of reference" doesn't apply to the instances themselves. From this point, it looks like we are the same as in case of a linked list: the instances themselves may reside "anywhere" in memory: the array only guarantees that their references are stored in a contiguous block...

...and here comes the use-case: complete sequential traversal (iteration). First, let's examine how we get the references to instances in each case. In case of the array, it's very efficient, because they're in a contiguous block. But in case of the linked list, we're also good, because once we've accessed a Message instance (that's why we iterate!), we immediately have the reference to the next instance. And since we've accessed a field of Message already, accessing another field (the "next") should be supported by the cache (the fields of the same object also have a locality of references AFAIK, they're in a contiguous block, too). To sum up, it seems to break down to this:

  1. The array offers a cache-friendly iteration over references. Message instances themselves may be "anywhere" in memory, and we need to visit these, too.
  2. The linked list offers that the reference to the next element is obtained when the current Message instance is accessed. This is "free", because each Message instance must be visited anyway (just like in the array case).

So, based on the above, it looks like the array is not better than the linked list. The only exception is when the array is of primitive type (but in such a case, it wouldn't make sense to compare it with a linked list). So I'd expect that they perform similarly, but they didn't, as there was a huge difference. In fact, if we assume that the array indexing requires a range check each time an element is accessed, the linked list (in theory) could be faster, even. (The range check of the array access is probably optimized away by the JIT, so I understand that this is not a valid point.)

My guesses are the following:

  1. It's probably not the array's cache-friendliness that is responsible for the 100% difference. Instead, the JIT performs optimizations that can't be done in case of the linked list traversal. If the range check and (VM-level) null check are eliminated, then I guess the "array-get" bytecode instruction might be faster than my "field-get" (or whatever it's called) instruction in the linked list case (?).

  2. Even though the Message instances could be "anywhere" in memory, they are probably very close to each other, because they were allocated "at the same time". But 1,000,000 instances can't be cached, only a part of them. In such a case, sequential access would be cache-friendly both in the array and in the linked list case, so this doesn't explain the difference.

  3. Some intelligent "prediction" (prefetch) of the Message instance I'll access? I.e. somehow there is still cache-friendliness with the Message instances themselves, but only in case of the array access.

UPDATE: Since several comments were received, I'd like to react to them below.

@irreputable:

the linked list is visited from high address to low address. what if it's the other way around, i.e. next points to a newer object, not a previous object

Very good spot! I didn't think to this little detail that the layout may influence the test. I'll test it today and will return with the results. (Edit: results are here, I've updated this post with "Section 2").

@Torben comments:

Also I would say that this whole exercise seems pretty useless. You are talking about 4ms improvement over 100000 iterations. Seems like premature optimization. If you have a situation where this is a bottleneck, then please describe it and we can look into it (because it definitely would be a more interesting problem than this).

If it's not interesting to you, then you can disregard this topic (instead of posting 4 times). About your baseless assumption of "premature optimization" -- I'm afraid you read too much SO and perform too little industrial-level development. The concrete situation is in a simulation-related software which might have to traverse these lists several times per second. Indeed, 120+ ms latency may affect the responsiveness of an app.

I appreciate the thought you put into this, but I really can't find a question from your post. :) Edit: And array iteration is 50% faster. 100% faster would mean zero time consumed.

I'm sure it was pretty obvious from my post that I'm wondering why is the very significant difference present, when the arguments would imply otherwise. Thanks for the correction: indeed, I wanted to write that the linked list case is 100% slower.


SECTION TWO: The Modified Test Cases

irreputable had a very interesting observation for me:

the linked list is visited from high address to low address. what if it's the other way around, i.e. next points to a newer object, not a previous object

I changed the linked list structure such that the direction of its next pointers is equal to the instantiation order of its nodes:

Message current = null;
Message previous = new Message(r, null);
Message first = previous;
for (int i = 0; i < cnt; i++) {
    current = new Message(r, null);
    previous.next = current;
    previous = current;
}       
previous = current = null;

(Note that the creation algorithm might not be the most compact one, I think I knew a slightly nicer way.) The code that iterates through this linked list:

while (first != null) {
    if (first.value1 > 564645) {
        val++;
    }
    first = first.next;
}

And now the result I get is constantly 37-39 ms (OK, we can say it's exactly the array's performance, but actually, it's slightly faster in every test case, constantly.) Instead of the 80 ms of the "reversed-direction" linked list, it's twice as fast!

Then I made a similar test with the original array test case as well: I changed the array traversal to opposite direction (to a count-down loop):

for (int i = cnt - 1; i >= 0; i--) {
    Message msg = messages[i];
    if (msg.value1 > 564645) {
        val++;
    }
}

And the result is constantly 85-90 ms! The original test case produced 40-41 ms.

It seems there are two new conclusions (and one question) now:

  1. The original claim seems to be true that the array's "locality of reference" (due to the contigous memory block) doesn't provide an advantage in case of "reference-type" (i.e. object) arrays when they're compared with linked lists. This is because the object arrays only hold references to object instances, not the object instances themselves (which can be, in theory, "anywhere" in memory, just like in case of the linked list).

  2. In my test cases, the result seems to be dependent on the direction of traversal, even in case of the array scenario (!). How is this possible?

To sum up my test results:

  1. In "forward" direction traversal, the linked list slightly outperforms the array traversal (exactly as expected: we have the next reference immediately when a Message instance is obtained, i.e. even there is no need to access an array element for obtaining its address).

  2. In "backward" direction traversal, both have about 100% weaker performance (and the linked list also slightly outperforms the array).

Any ideas?

UPDATE 1: dlthorpe made very valuable comments. I'll copy them here, as they might help in finding an answer to this "riddle".

Is there any indication that the hardware implements a look-ahead page prefetch in the memory cache controller? Instead of loading only the memory page needed for a memory reference, also load the next higher page in anticipation of a forward progressive read? This would eliminate page load waits for forward progressions through memory but would not eliminate page load waits for reverse progressions through memory.

[..]

I'd suggest testing on radically different hardware. Most mobile devices are running some form of ARM SoC. See if the test cases show similar skew on Intel hardware, like a PC or a Mac. If you can dig up an old PowerPC Mac, even better. If these don't show similar results, then that would point to something unique on the ARM platform or its Java implementation.

[..]

Right, your access patterns are mostly sequential, but in different directions. If something underneath you is doing prefetch but only in one direction (prefetch next higher address block), then that will skew the results in favor of the tests that run in that direction.

UPDATE 2: I ran the tests on a PC (Core i7 Nehalem architecture from Feb. 2009, 8 GB RAM, Windows 7). I used C#.NET in a .NET 2.0 source code project (but the .NET 4 is installed on the machine). My results, with 25 million Message instances:

  • Linked list: 57-60 ms
  • Array: 60-63 ms

The direction of reading didn't seem to influence the results.

回答1:

Talking about PC hardware, early hardware prefetchers (say circa 2005) were better at detecting and prefetching forward accesses, but more recent hardware should be good at detecting both directions. If you are interested in mobile hardware, it is totally possible that it still implements basic forward-only prefetching.

Outside of proper prefetch implemented in the MMU, which actually detects access patterns, it is very common for hardware to get more than one cache line when a cache miss occurs. Often this takes the form of simply getting next cache line, in addition to the required one, when a miss occurs. This implementation would give the forward direction a big advantage by effectively halving the cache miss rate in that case (this assumes prefetching is ineffective).

Locally, on Core i7, I get slightly better results for the linked list version at ~3.3 ms for the whole iteration, vs 3.5 ms for the array version - when using the original program (which iterates the link list in reverse order of creation). So I don't see the same effect you did.

The inner loop for your test, checking the value of val, has a big impact. The current loop will cause a lot of mispredicts, unless the JIT compiler is smart enough to use CMOV or something similar. It seems that in my test, it was - since I got about 1 ns / iteration for small iteration counts that fit in L1. 1 ns (about 3 cycles) isn't consistent with a full branch mis-predict. When I changed it to do an unconditional val += msg.value1, the array version got a significant boost, even in 1,000,000 iteration case (which won't even fit in L3, probably).

Interestingly enough, the same transformation (val += msg.value1) made the linked list version slightly slower. With the transformation, the array version was considerably faster at small iteration counts (inside L2, and the two approaches were comparable outside). From caliper:

  length method         ns linear runtime
     100  ARRAY       63.7 =
     100 LINKED      190.1 =
    1000  ARRAY      725.7 =
    1000 LINKED     1788.5 =
 1000000  ARRAY  2904083.2 ===
 1000000 LINKED  3043820.4 ===
10000000  ARRAY 23160128.5 ==========================
10000000 LINKED 25748352.0 ==============================

The behavior for small iteration counts is easier to explain - the linked list, which has to use pointer chasing, has a data dependency between each iteration of the loop. That is, each iteration depends on the previous, because the address to load comes from the previous element. The array doesn't have this same data dependency - only the increment of i is dependent, and this is very fast (i is certainly in a register here). So the loop can be much better pipelined in the array case.



回答2:

I don't know the answer, but I would start with looking at the size of the generated bytecode. Since in the array case, the number of iterations is known (cnt is hardcoded and final), the compiler may have inlined some iterations, saving on the jump and comparisons instructions.

Also, if you know the basics of how a program works at the low-level layers, looking at the disassembled bytecode might give you some hints. Even if you are not fluent with assembler languages, it is not too hard to understand a simple program like yours (I was surprised at how much I could figure out the first time I saw some disassembled java code).

Hope this helps.