可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm trying to write a thread-safe sorted single linked list. I wrote two versions: coarse grained synchronization and fine grained synchronization. Here are the two implementations:

Fine grained:

public void add(T t) {                                                         
  Node curr = head;
  curr.lock.lock();

  while (curr.next != null) {
    // Invariant: curr is locked                                               
    // Invariant: curr.data < t                                                
    curr.next.lock.lock();                                                     

    if (t.compareTo(curr.next.data) <= 0) {                                    
      break;                                                                   
    }                                                                          

    Node tmp = curr.next;                                                      
    curr.lock.unlock();                                                        
    curr = tmp;                                                                
  }                                                                            

  // curr is acquired                                                          
  curr.next = new Node(curr.next, t);                                          
  if (curr.next.next != null) {  // old curr's next is acquired                
    curr.next.next.lock.unlock();                                              
  }                                                                            
  curr.lock.unlock();                                                          
}

Coarse grained:

public void add(T t) {
  lock.lock();
  Node curr = head;
  while (curr.next != null) {
    if (t.compareTo(curr.next.data) <= 0) {
      break;
    }                                                                          
    curr = curr.next;                                                          
  }                                                                            
  curr.next = new Node(curr.next, t);                                          
  lock.unlock();                                                               
}

I timed the two version on 4 threads (on 4 logical CPU cores) inserting 20000 integers. The time per thread shows CPU time (i.e. it does not include waiting time).

Fine grained:
Worked 1 spent 1080 ms
Worked 2 spent 1230 ms
Worked 0 spent 1250 ms
Worked 3 spent 1260 ms
wall time: 1620 ms

Coarse grained:
Worked 1 spent 190 ms
Worked 2 spent 270 ms
Worked 3 spent 410 ms
Worked 0 spent 280 ms
wall time: 1298 ms

My initial thought was that .lock() and .unlock() are the problem, but I profiled the implementation and together they consumed only 30% of the time. My second guess is that the fine grained solution has more cache misses, but I doubt it because a single linked list, unlike an array, is inherently prone to cache misses.

Any idea why I don't get the expected parallelization?

回答1:

Yes, that is probably due to cache misses. The cache lines containing the locks are continually bouncing between CPUs.

Also, note that you have gained quite a lot of parallellism:

Fine grained:
Worked 1 spent 1080 ms
Worked 2 spent 1230 ms
Worked 0 spent 1250 ms
Worked 3 spent 1260 ms
wall time: 1620 ms

Coarse grained:
Worked 1 spent 190 ms
Worked 2 spent 270 ms
Worked 3 spent 410 ms
Worked 0 spent 280 ms
wall time: 1298 ms

Although each individual thread takes a lot more time, due to cache misses (and also increased overhead), the entire process is only slightly slower.

回答2:

You could achieve a wall time near that of the per thread coarse grained version by first walking the list with no locks in order to find the gap and then from current, and this time using locks, walk the list to make sure there were no intervening inserts by other threads between current and current->next. (Of course I am discounting the fact that "head" always reigns supreme :)

回答3:

There does appear to a performance problem. I think you should compare you performance with the built-in implementation and with a single threaded version.

for (int r = 0; r < 5; r++) {
    long start = System.nanoTime();
    ConcurrentLinkedQueue<Integer> list = new ConcurrentLinkedQueue<Integer>();
    for (int i = 0; i < 500000; i++)
        list.add(i);
    long time = System.nanoTime() - start;
    System.out.printf("Adding 500K %,d took ms%n", time / 1000 / 1000);
}

prints

Adding 500K 192 took ms
Adding 500K 154 took ms
Adding 500K 95 took ms
Adding 500K 211 took ms
Adding 500K 424 took ms

回答4:

in addition to ninjalj's answer - the fine locks also

disables some compiler optimizations in existing code
disables some CPU optimizations - like prefetching
forces memory acquire semantics on lock, and release semantics on unlock - causing cross-CPU sync and invalidating caches - which does not directly show up as lock() cost in profiler, but increases the cost of following data access.

回答5:

Am I missing something? I don't see any type of caching within your code. Also, you should reconsider the way you are using locking. You should only lock the list as a whole to limit the number of locks and also prevents a race condition as shown below.

thread1: Read Element X
thread2: Removes X + 1
thread1: Read Element X + 1 and fails since the element is no long valid.
thread1: Is unable to finish going through the list since it has been removed.

You can do partitioning the list but you have to handle the case where you are reading the last element in a partition and removing the first element in the next partition.

You could also make only certain functions need to lock/unlock depending on what type of operation is occurring (i.e. it's a read operation and there is not currently a write operation is occurring).