Will multi threading provide any performance boost

2020-05-26 16:17发布

I am new to programming in general so please keep that in mind when you answer my question.

I have a program that takes a large 3D array (1 billion elements) and sums up elements along the various axis to produce a 2D array of a projection of each side of the data. The problem here is that it is very ram intensive as the program is constantly fetching information from the ram, both reading and writing.

The question is, will i gain any performance increases if i multithread the program or will I end up running into a RAM access bottleneck? When i say multithreading, i only mean multithreading for 2 or 4 cores, no more.

If it helps, my current computer configuration is 2.4ghz core2 quad, 1033 fsb, 4gb ram at 667mhz.

Thanks in advance,

-Faken

Edit:

It seems to me that people here are much more interested in this question that I had first expected. I'll expand the question and post some code for those who are interested.

First of all, a little background on me so you understand where I'm coming from. I am a mechanical engineering graduate student who some how managed to pick a topic that pretty much had nothing to do with mechanical engineering. I have taken 1 course in introductory java (forced) approximately 5 years ago and have never touched programming until about a month ago when i started my thesis in earnest. I have also taken (again forced, still don't know why) a course in electronics and computer engineering, we dealt with micro-controllers (8-bit), their inner workings, and some ASM coding for them. Other than that, I know next to nothing about programming.

Here is the code:

int dim = 1000;
int steps = 7 //ranges from 1 to  255

for (int stage = 1; stage < steps; stage++)
for (int j = 0; j < dim; j++)
    for (int i = 0; i < dim; i++)
    {
        sum = 0;
        for (int k = 0; k < dim; k++)
            if (partMap[(((i * dim) + k) * dim) + j] >= stage)
                sum++;

        projection[(j*dim) + i] = sum;
    } 

This section of code operates on the z-axis only. The main data, due to the way it was constructed, has a strange addressing system but you don't need to worry about that. There is also other code for doing the projections of other sides of the cube but they do very different things.

19条回答
相关推荐>>
2楼-- · 2020-05-26 16:36

I guess if you are just dealing with bits you might not have to page or use a swap file and in that case YES multi-threading will help.

If you can't load everything into memory at once, you need to be more specific about your solution--it needs to be tailored to threading.

For example: Suppose you load your array in smaller blocks (Size might not matter much). If you were to load in a 1000x1000x1000 cube, you could sum on that. The results could be stored temporarially in their own three plains, then added to your 3 "final result" planes, then the 1000^3 block could be thrown away never to be read again.

If you do something like this, you won't run out of memory, you won't stress the swapfile and you won't have to worry about any thread synchronization except in a few very small, specific areas (if at all).

The only problem then is to ensure your data is in such a format that you can access a single 1000^3 cube directly--without seeking the hard disk head all over the place.

Edit: The comment was correct and I'm wrong--he totally makes sense.

Since yesterday I realized that the entire problem could be solved as it was read in--each piece of data read in could immediately be summed into the results and discarded. When I think about it that way, you're right, not going to be much help unless the threading can read two streams at the same time without colliding.

查看更多
兄弟一词,经得起流年.
3楼-- · 2020-05-26 16:38

There is only one way to optimize code: figure out what you're doing that's slow, and do less of it. A special case of "doing less of it" is to do something else instead that's faster.

So first of all, here's what I'm doing based on your posted code:

#include <fstream>
#include <sstream>
using std::ios_base;

template<typename Iterator, typename Value>
void iota(Iterator start, Iterator end, Value val) {
    while (start != end) {
        *(start++) = val++;
    }
}

int main() {

    const int dim = 1000;
    const int cubesize = dim*dim*dim;
    const int squaresize = dim*dim;
    const int steps = 7; //ranges from 1 to  255
    typedef unsigned char uchar;

    uchar *partMap = new uchar[cubesize];
    // dummy data. I timed this separately and it takes about
    // a second, so I won't worry about its effect on overall timings.
    iota(partMap, partMap + cubesize, uchar(7));
    uchar *projection = new uchar[squaresize];

    for (int stage = 1; stage < steps; stage++) {
        for (int j = 0; j < dim; j++) {
                for (int i = 0; i < dim; i++)
                {
                        int sum = 0;
                        for (int k = 0; k < dim; k++)
                            if (partMap[(((i * dim) + k) * dim) + j] >= stage)
                                sum++;

                        projection[(j*dim) + i] = sum;
                }
        }

        std::stringstream filename;
        filename << "results" << stage << ".bin";
        std::ofstream file(filename.str().c_str(), 
            ios_base::out | ios_base::binary | ios_base::trunc);
        file.write((char *)projection, squaresize);
    }

    delete[] projection;
    delete[] partMap;
}

(Edit: just noticed that "projection" should be an array of int, not uchar. My bad. This will make a difference to some of the timings, but hopefully not too big of one.)

Then I copied result*.bin to gold*.bin, so I can check my future changes as follows:

$ make big -B CPPFLAGS="-O3 -pedantic -Wall" && time ./big; for n in 1 2 3 4 5
6; do diff -q results$n.bin gold$n.bin; done
g++  -O3 -pedantic -Wall   big.cpp   -o big

real    1m41.978s
user    1m39.450s
sys     0m0.451s

OK, so 100 seconds at the moment.

So, speculating that it's striding through the billion-item data array that's slow, let's try only going through once, instead of once per stage:

    uchar *projections[steps];
    for (int stage = 1; stage < steps; stage++) {
         projections[stage] = new uchar[squaresize];
    }

    for (int j = 0; j < dim; j++) {
            for (int i = 0; i < dim; i++)
            {
                    int counts[256] = {0};
                    for (int k = 0; k < dim; k++)
                            counts[partMap[(((i * dim) + k) * dim) + j]]++;

                    int sum = 0;
                    for (int idx = 255; idx >= steps; --idx) {
                        sum += counts[idx];
                    }
                    for (int stage = steps-1; stage > 0; --stage) {
                        sum += counts[stage];
                        projections[stage][(j*dim) + i] = sum;
                    }
            }
    }

    for (int stage = 1; stage < steps; stage++) {
        std::stringstream filename;
        filename << "results" << stage << ".bin";
        std::ofstream file(filename.str().c_str(),
            ios_base::out | ios_base::binary | ios_base::trunc);
        file.write((char *)projections[stage], squaresize);
    }

    for (int stage = 1; stage < steps; stage++) delete[] projections[stage];
    delete[] partMap;

It's a bit faster:

$ make big -B CPPFLAGS="-O3 -pedantic -Wall" && time ./big; for n in 1 2 3 4 5
6; do diff -q results$n.bin gold$n.bin; done
g++  -O3 -pedantic -Wall   big.cpp   -o big

real    1m15.176s
user    1m13.772s
sys     0m0.841s

Now, steps is quite small in this example, so we're doing a lot of unnecessary work with the "counts" array. Without even profiling, I'm guessing that counting to 256 twice (once to clear the array and once to sum it) is quite significant compared with counting to 1000 (to run along our column). So let's change that:

    for (int j = 0; j < dim; j++) {
            for (int i = 0; i < dim; i++)
            {
                    // steps+1, not steps. I got this wrong the first time,
                    // which at least proved that my diffs work as a check
                    // of the answer...
                    int counts[steps+1] = {0};
                    for (int k = 0; k < dim; k++) {
                        uchar val = partMap[(((i * dim) + k) * dim) + j];
                        if (val >= steps) 
                            counts[steps]++;
                        else counts[val]++;
                    }

                    int sum = counts[steps];
                    for (int stage = steps-1; stage > 0; --stage) {
                        sum += counts[stage];
                        projections[stage][(j*dim) + i] = sum;
                    }
            }
    }

Now we're only using as many buckets as we actually need.

$ make big -B CPPFLAGS="-O3 -pedantic -Wall" && time ./big; for n in 1 2 3 4 5
6; do diff -q results$n.bin gold$n.bin; done
g++  -O3 -pedantic -Wall   big.cpp   -o big

real    0m27.643s
user    0m26.551s
sys     0m0.483s

Hurrah. The code is nearly 4 times as fast as the first version, and produces the same results. All I've done is change what order the maths is done: we haven't even looked at multi-threading or prefetching yet. And I haven't attempted any highly technical loop optimisation, just left it to the compiler. So this can be considered a decent start.

However it's still taking an order of magnitude longer than the 1s which iota runs in. So there are probably big gains still to find. One main difference is that iota runs over the 1d array in sequential order, instead of leaping about all over the place. As I said in my first answer, you should aim to always use sequential order on the cube.

So, let's make a one-line change, switching the i and j loops:

            for (int i = 0; i < dim; i++)
    for (int j = 0; j < dim; j++) {

This still isn't sequential order, but it does mean we're focussing on one million-byte slice of our cube at a time. A modern CPU has at least 4MB cache, so with a bit of luck we'll only hit main memory for any given part of the cube once in the entire program. With even better locality we could reduce the traffic in and out of L1 cache, too, but main memory is the slowest.

How much difference does it make?

$ make big -B CPPFLAGS="-O3 -pedantic -Wall" && time ./big; for n in 1 2 3 4 5
6; do diff -q results$n.bin gold$n.bin; done
g++  -O3 -pedantic -Wall   big.cpp   -o big

real    0m8.221s
user    0m4.507s
sys     0m0.514s

Not bad. In fact, this change alone brings the original code from 100s to 20s. So this is responsible for a factor of 5, and everything else I did is responsible for another factor of 5 (I think the difference between 'user' and 'real' time in the above is mostly accounted for by the fact my virus scanner is running, which it wasn't earlier. 'user' is how much time the program occupied a CPU, 'real' includes time spent suspended, either waiting on I/O or giving another process time to run).

Of course, my bucket sort relies on the fact that whatever we're doing with the values in each column is commutative and associative. Reducing the number of buckets only worked because large values are all treated the same. This might not be true for all your operations, so you'll have to look at the inner loop of each one in turn to figure out what to do with it.

And the code is a bit more complicated. Instead of running over the data doing "blah" for each stage, we're computing all the stages at the same time in a single run over the data. If you start doing row and column computations in a single pass, as I recommended in my first answer, this will get worse. You may have to start breaking your code into functions to keep it readable.

Finally, a lot of my performance gain came from an optimisation for the fact that "steps" is small. With steps=100, I get:

$ make big -B CPPFLAGS="-O3 -pedantic -Wall" && time ./big; for n in 1 2 3 4 5
6; do diff -q results$n.bin gold$n.bin; done
g++  -O3 -pedantic -Wall   big.cpp   -o big

real    0m22.262s
user    0m10.108s
sys     0m1.029s

This isn't so bad. With steps=100 the original code probably takes about 1400 seconds, although I'm not going to run it to prove that. But it's worth remembering that I haven't completely taken away the time dependency on "steps", just made it sub-linear.

查看更多
叼着烟拽天下
4楼-- · 2020-05-26 16:38

Though this would probably be very challenging to you if you're new to programming, a very powerful way to speed things up would be to use the power of GPU. Not only is the VRAM much faster than usual RAM, the GPU can also run your code in parallel on some 128 or more cores. Of course, for this amount of data you will need to have a pretty large VRAM.

If you decide to check this possibility out, you should look up nVidia CUDA. I haven't checked it out myself, but it's meant for problems like this.

查看更多
爱情/是我丢掉的垃圾
5楼-- · 2020-05-26 16:39

Try this code:

int dim = 1000;
int steps = 7 //ranges from 1 to  255

for (int stage = 1; stage < steps; stage++)
for (int k = 0; k < dim; k++)
    for (int i = 0; i < dim; i++)
    {
            sum = 0;
            for (int j = 0; j < dim; j++)
                    if (partMap[(((i * dim) + k) * dim) + j] >= stage)
                            projection[i*dim + j] ++ ;
                            // changed order of i and j
    }


transponse(projection)

I changed the order of loops to make the code cache friendly... You would gain with it an order of magninute performance boost... Be shure.

This is the step you should do before you try to run into multithreading

查看更多
在下西门庆
6楼-- · 2020-05-26 16:41

It's impossible to tell, in general, because you did not specify how fast your CPU and RAM are. Good chances are that it will improve things, because I can't imagine how even 4 threads summing in parallel would saturate RAM enough that it would become a bottleneck (and not the CPU).

查看更多
兄弟一词,经得起流年.
7楼-- · 2020-05-26 16:45

Eliminate False Sharing

This is where is multiple cores are blocking on each other trying to read or update different memory addresses that share the same block cache. Processor cache locking is per block, and only one thread can write to that block at once.

Herb Sutter has a very good article on False Sharing, how to discover it and how to avoid it in your parallel algorithms.

Obviously he has loads of other excellent articals on concurrent programming too, see his blog.

查看更多
登录 后发表回答