I am writing a multi-threaded application in Java in order to improve performance over the sequential version. It is a parallel version of the dynamic programming solution to the 0/1 knapsack problem. I have an Intel Core 2 Duo with both Ubuntu and Windows 7 Professional on different partitions. I am running in Ubuntu.
My problem is that the parallel version actually takes longer than the sequential version. I am thinking this may be because the threads are all being mapped to the same kernel thread or that they are being allocated to the same core. Is there a way I could ensure that each Java thread maps to a separate core?
I have read other posts about this problem but nothing seems to help.
Here is the end of main() and all of run() for the KnapsackThread class (which extends Thread). Notice that they way I use slice and extra to calculate myLowBound and myHiBound ensure that each thread will not overlap in domain of the dynProgMatrix. Therefore there will be no race conditions.
dynProgMatrix = new int[totalItems+1][capacity+1];
for (int w = 0; w<= capacity; w++)
dynProgMatrix[0][w] = 0;
for(int i=0; i<=totalItems; i++)
dynProgMatrix[i][0] = 0;
slice = Math.max(1,
(int) Math.floor((double)(dynProgMatrix[0].length)/threads.length));
extra = (dynProgMatrix[0].length) % threads.length;
barrier = new CyclicBarrier(threads.length);
for (int i = 0; i < threads.length; i++){
threads[i] = new KnapsackThread(Integer.toString(i));
}
for (int i = 0; i < threads.length; i++){
threads[i].start();
}
for (int i = 0; i < threads.length; i++){
try {
threads[i].join();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
public void run(){
int myRank = Integer.parseInt(this.getName());
int myLowBound;
int myHiBound;
if (myRank < extra){
myLowBound = myRank * (slice + 1);
myHiBound = myLowBound + slice;
}
else{
myLowBound = myRank * slice + extra;
myHiBound = myLowBound + slice - 1;
}
if(myHiBound > capacity){
myHiBound = capacity;
}
for(int i = 1; i <= totalItems; i++){
for (int w = myLowBound; w <= myHiBound; w++){
if (allItems[i].weight <= w){
if (allItems[i].profit + dynProgMatrix[i-1][w-allItems[i].weight]
> dynProgMatrix[i-1][w])
{
dynProgMatrix[i][w] = allItems[i].profit +
dynProgMatrix[i-1][w- allItems[i].weight];
}
else{
dynProgMatrix[i][w] = dynProgMatrix[i-1][w];
}
}
else{
dynProgMatrix[i][w] = dynProgMatrix[i-1][w];
}
}
// now place a barrier to sync up the threads
try {
barrier.await();
} catch (InterruptedException ex) {
ex.printStackTrace();
return;
} catch (BrokenBarrierException ex) {
ex.printStackTrace();
return;
}
}
}
Update:
I have written another version of the knapsack that uses brute force. This version has very little synchronization because I only need to update a bestSoFar variable at the end of a single thread's execution. Therefore, each thread pretty much should execute completely in parallel except for that small critical section at the end.
I ran this versus the sequential brute force and still it takes longer. I don't see any other explanation than that my threads are being run sequentially, either because they are being mapped to the same core or to the same native thread.
Does anybody have any insight?
I suggest you take a look at how long it takes for each of your worker threads before they terminate. Perhaps one of the threads has a much more difficult task. If that's the case, then the overhead caused by synchronization and so on will easily eat up what you've gained from threading.
I doubt that it will be due to using the same core for all threads. The scheduling is up to the OS, but you should be able to see what's going on if you bring up the performance manager for the OS - it will typically show how busy each core is.
Possible reasons for it taking longer:
I was having the same problem for a while. I had a CPU-hungry program that I divided in 2 threads (double core CPU), but one beautifull day, while processing some more data, it just stopped using both cores. I just raised the heap mem size (
-Xmx1536m
in my case), and it worked fine again.