I am new to programming in general so please keep that in mind when you answer my question.
I have a program that takes a large 3D array (1 billion elements) and sums up elements along the various axis to produce a 2D array of a projection of each side of the data. The problem here is that it is very ram intensive as the program is constantly fetching information from the ram, both reading and writing.
The question is, will i gain any performance increases if i multithread the program or will I end up running into a RAM access bottleneck? When i say multithreading, i only mean multithreading for 2 or 4 cores, no more.
If it helps, my current computer configuration is 2.4ghz core2 quad, 1033 fsb, 4gb ram at 667mhz.
Thanks in advance,
-Faken
Edit:
It seems to me that people here are much more interested in this question that I had first expected. I'll expand the question and post some code for those who are interested.
First of all, a little background on me so you understand where I'm coming from. I am a mechanical engineering graduate student who some how managed to pick a topic that pretty much had nothing to do with mechanical engineering. I have taken 1 course in introductory java (forced) approximately 5 years ago and have never touched programming until about a month ago when i started my thesis in earnest. I have also taken (again forced, still don't know why) a course in electronics and computer engineering, we dealt with micro-controllers (8-bit), their inner workings, and some ASM coding for them. Other than that, I know next to nothing about programming.
Here is the code:
int dim = 1000;
int steps = 7 //ranges from 1 to 255
for (int stage = 1; stage < steps; stage++)
for (int j = 0; j < dim; j++)
for (int i = 0; i < dim; i++)
{
sum = 0;
for (int k = 0; k < dim; k++)
if (partMap[(((i * dim) + k) * dim) + j] >= stage)
sum++;
projection[(j*dim) + i] = sum;
}
This section of code operates on the z-axis only. The main data, due to the way it was constructed, has a strange addressing system but you don't need to worry about that. There is also other code for doing the projections of other sides of the cube but they do very different things.
If, and this is a big IF, it is coded appropriately you will most definitely see a speed up. Now as one of my professors always noted, people often try to take an algorithm, thread it and in the end it is slower. This is often because of inefficient synchronization. So basically if you feel like delving into threading (I honestly wouldn't suggest it if you are new to programming) have a go.
In your particular case the synchronization could be quite straightforward. This is to say, you could assign each thread to a quadrant of the large 3-d matrix, where each thread is guaranteed to have sole access to a specific area of the input and output matrices, thus there is no real need to 'protect' the data from multiple access/writes.
In summary, in this specific simple case threading may be quite easy, but in general synchronization when done poorly can cause the program to take longer. It really all depends.
Your computer system typically has some elements that limit the rough performance. Which part is your limiting elements, depends on the concrete situation. Normally one of the following factors may be the cause of your performance problems.
Disk I/O bandwidth: In most enterprise applications the sheer size of data processed requires it to be stored in some database. Acessing this data may be slowed down by both: the maximum transfer speed, but very often the biggest impact will be caused by a big number of small disk accesses reading some blocks here and there. The you will see the latency time of the heads of the disks moving around and even the time the disk requires for a full rotation may limit your application. Long times ago i had a real problem using some expansive SUN E430 installation that was outperformed by my small NeXTstation... It was the constant fsync()ing of my database which was slowed down by disks not caching write accesses (for good reason). Normally you can speed up your system by adding additional disks to get more I/O per second. Dedicating your drives to specific tasks may even do better in some cases.
Network Latency: nearly everything that affects application speed said for disks is equivalent for Network I/O.
RAM: If your RAM is not big enough to store your complete application image you need to store it on an external disks. Therefore the Disk I/O slowdown bites you again.
CPU processing speed (either Integer or floating point): CPU processing power is the next factor that is a limit for CPU intensive tasks. A CPU has a physical speed limit that cannot be outreached. The only way to speed up is to add more CPU.
These limits may help you to find an answer for your specific problem.
Do you need simply more processing power and your system has more than one CPU or Core? In that case multithreading will improve your performance.
Do you observe significant Network or Disk Latency? If you see this, your valuable CPU might throw away CPU cycles waiting for some slow I/O. If more that one thread is active, this thread might find all data required for processing in memory and could pick up these otherwise wasted CPU cycles.
Therefore you need to observe your existing application. try to extimate the memory bandwidth of the data shuffled around. If the application is active on one CPU below 100%, you might have reached the memory bandwidth limit. In that case, additional threading will do no good for you because this does not give you mor bandwidth from memory.
If the CPU is at 100%, give it a try, but have a look at the algorithms. Multi-threading will add additional overhead for synchronization (and complexity, tons of complexity) that might slightly reduce the memory bandwidth. Prefer alorithms that can be implemented avoiding fine grained synchronizations.
If you see I/O wait times, think about clever partitioning or caching and then about threading. There is a reason why GNU-make supported parallel build back in the 90's :-)
The problem domain you've described leads me to gav a look at clever algorithms first. Try to using sequential read/write operations on main memory as much as possible to support the CPU and memory subsystems as much as possible. Keep operations "local" and datastructures as small and optimzed as possible to reduce the amount of memory that needs to be shuffled around before switching to a second core.
The questions you need to answer for your particular application are well-known.
First, is the work parallelisable? Amdahl's Law will give you an upper bound on how much you can speed things up with multithreading.
Second, would a multithreaded solution introduce a lot of overhead? You say the program is "RAM intensive as the program is constantly fetching information from the RAM, both reading and writing." So you need to determine if the reading/writing is going to cause significant coordination overhead. This isn't easy. Although each CPU can access the computer's entire RAM (both read and write) at any time, doing so can slow down memory accesses -- even without locks -- because the various CPUs keep their own caches and need to coordinate what's in their caches with each other (CPU 1 has a value in cache, CPU 2 updates that value in RAM, CPU 2 has to tell CPU 1 to invalidate its cache). And if you do need locks (which is almost a guarantee as you're both "reading and writing" memory) then you'll need to avoid contention as much as possible.
Third, are you memory bound? "RAM intensive." is not the same thing as "memory bound." If you are currently CPU bound then multithreading will speed things up. If you are currently memory bound then multithreading may even slow things down (if one thread is too fast for memory, then what will happen with multiple threads?).
Fourth, are you slow for some other reason? If you're
new
ing ormalloc
ing a lot of memory in your algorithm you may be seeing overheads from that alone. And on many platforms bothnew
andmalloc
don't handle multithreading well, so if you're slow right now becausemalloc
is bad, a multithreaded program will be even slower becausemalloc
will be worse.Overall, however, without seeing your code, I would expect it to be CPU bound and I would expect multithreading to speed things up -- almost as much as Amdahl's law would suggest, in fact. You may want to look at OpenMP or Intel's Threading Building Blocks library, or some sort of thread queue to do it, though.
It's a matrix problem?
Both Intel and AMD have super-optimized libraries for all sorts of heavy math problems. These libraries use threading, arrange the data for best cache use, cache prefetch, SSE vector instructions. Everything.
I believe you have to pay for the libraries, but they are well worth the money.
Multithreading will only make your code faster if the computations can be broken down into chunks that can be worked on independently and concurrently.
EDIT
I said the above (it's almost an automatic response) because I see many developers spend a lot of time on multithreading code for no performance increase at all. Of course, then they end up with the same (or even slower performance) and the extra complications of managing the multiple threads.
Yes, it does appear after reading your question again and taking into account your specific case you would benefit from multithreading.
RAM is very fast, so I think it would be very hard to saturate the memory bandwidth unless you have many, many threads.
How does your code work. Does it go like this?
If so, you might want to read up on "locality of reference". Depending how your data is stored, you might find that while you're doing the stacks, a whole cache line has to be pulled in for each value, because the values are nowhere near each other in memory. In fact, with a billion values, you could be pulling things all the way from disk. Sequential access with a long stride (distance between values) is the worst possible use for cache. Try profiling, and if you see that adding up the stacks is taking longer than adding up the rows, this is almost certainly why.
I think you could be saturating the memory bus(*), in which case multithreading would only help if core2 quad uses different buses for different cores. But if you're not saturating the bus bandwidth, you can't get best performance this way even once you multi-thread. You'll have 4 cores spending all their time stalled on cache misses instead of one.
If you are memory cache bound, then your goal should be to visit each page/line of memory as few times as possible. So I'd try things like running over the data once, adding each value to three different totals as you go. If that runs faster on a single core, then we're in business. The next step is that with a 1000x1000x1000 cube, you have 3 million totals on the go. That doesn't fit in cache either, so you have to worry about the same cache miss problems writing as you do reading.
You want to make sure that as you run along a row of 1000 adjacent values in RAM adding to the row total that they all share, you're also adding to adjacent totals for the columns and stacks (which they don't store). So the "square" of column totals should be stored in the appropriate way, as should the "square" of stacks. That way you deal with 1000 of your billion values just by pulling about 12k of memory into cache (4k for 1000 values, plus 4k for 1000 column totals, plus 4k for 1000 stack totals). As against that, you're doing more stores than you would be by concentrating on 1 total at a time (which therefore could be in a register).
So I don't promise anything, but I think it's worth looking at order of memory access, whether you multi-thread or not. If you can do more CPU work while accessing only a relatively small amount of memory, then you'll speed up the single-threaded version but also put yourself in much better shape for multi-threading, since the cores share a limited cache, memory bus, and main RAM.
(*) Back of envelope calculation: in random random reviews off the internet the highest estimated FSB bandwidth for Core2 processors I've found so far is an Extreme at 12GB/s, with 2 channels at 4x199MHz each). Cache line size is 64 bytes, which is less than your stride. So summing a column or stack the bad way, grabbing 64 bytes per value, would only saturate the bus if it was doing 200 million values per second. I'm guessing it's nothing like this fast (10-15 seconds for the whole thing), or you wouldn't be asking how to speed it up.
So my first guess was probably way off. Unless your compiler or CPU has inserted some very clever pre-fetching, a single core cannot be using 2 channels and 4 simultaneous transfers per cycle. For that matter, 4 cores couldn't use 2 channels and 4 simultaneous transfers. The effective bus bandwidth for a series of requests might be much lower than the physical limit, in which case you would hope to see good improvements from multi-threading simply because you have 4 cores asking for 4 different cache lines, all of which can be loaded simultaneously without troubling the FSB or the cache controller. But the latency is still the killer, and so if you can load less than one cache line per value summed, you'll do much better.