So I realize this question sounds stupid (and yes I am using a dual core), but I have tried two different libraries (Grand Central Dispatch and OpenMP), and when using clock() to time the code with and without the lines that make it parallel, the speed is the same. (for the record they were both using their own form of parallel for). They report being run on different threads, but perhaps they are running on the same core? Is there any way to check? (Both libraries are for C, I'm uncomfortable at lower layers.) This is super weird. Any ideas?
相关问题
- Multiple sockets for clients to connect to
- What is the best way to do a search in a large fil
- Faster loop: foreach vs some (performance of jsper
- glDrawElements only draws half a quad
- Why wrapping a function into a lambda potentially
Most likely your execution time isn't bound by those loops you parallelized.
My suggestion is that you profile your code to see what is taking most of the time. Most engineers will tell you that you should do this before doing anything drastic to optimize things.
EDIT: Added detail for Grand Central Dispatch in response to OP comment.
While the other answers here are useful in general, the specific answer to your question is that you shouldn't be using
clock()
to compare the timing.clock()
measures CPU time which is added up across the threads. When you split a job between cores, it uses at least as much CPU time (usually a bit more due to threading overhead). Search for clock() on this page, to find "If process is multi-threaded, cpu time consumed by all individual threads of process are added."It's just that the job is split between threads, so the overall time you have to wait is less. You should be using the wall time (the time on a wall clock). OpenMP provides a routine
omp_get_wtime()
to do it. Take the following routine as an example:The results are:
You can see that the
clock()
time doesn't change much. I get 0.254 without thepragma
, so it's a little slower using openMP with one thread than not using openMP at all, but the wall time decreases with each thread.The improvement won't always be this good due to, for example, parts of your calculation that aren't parallel (see Amdahl's_law) or different threads fighting over the same memory.
EDIT: For Grand Central Dispatch, the GCD reference states, that GCD uses
gettimeofday
for wall time. So, I create a new Cocoa App, and inapplicationDidFinishLaunching
I put:and I get the following results on the console:
which is about the same as I was getting above.
This is a very contrived example. In fact, you need to be sure to keep the optimization at -O0, or else the compiler will realize we don't keep any of the calculations and not do the loop at all. Also, the integer that I'm taking the
cos
of is different in the two examples, but that doesn't affect the results too much. See theSTRIDE
on the manpage fordispatch_apply
for how to do it properly and for whyiterations
is broadly comparable tonum_threads
in this case.EDIT: I note that Jacob's answer includes
which is not correct (it has been partly fixed by an edit). Using
omp_get_thread_num()
is indeed a good way to ensure that your code is multithreaded, but it doesn't show "which core it's working on", just which thread. For example, the following code:prints out that it's using threads 0 to 49, but this doesn't show which core it's working on, since I only have eight cores. By looking at the Activity Monitor (the OP mentioned GCD, so must be on a Mac - go
Window/CPU Usage
), you can see jobs switching between cores, so core != thread.If you are using a lot of memory inside the loop, that might prevent it from being faster. Also you could look into pthread library, to manually handle threading.
Your question is missing some very crucial details such as what the nature of your application is, what portion of it are you trying to improve, profiling results (if any), etc...
Having said that you should remember several critical points when approaching a performance improvement effort:
Make sure you are not going against these points, because an educated guess (barring any additional details) will say that's exactly what you're doing.
I use the
omp_get_thread_num()
function within my parallelized loop to print out which core it's working on if you don't specifynum_threads
. For e.g.,The above will work for this pragma #pragma omp parallel for default(none) shared(a,b,c)
This way you can be sure that it's running on both cores since only 2 threads will be created.
Btw, is OpenMP enabled when you're compiling? In Visual Studio you have to enable it in the Property Pages,
C++ -> Language
and setOpenMP Support
toYes
It's hard to guess without any details. Maybe your application isn't even CPU bound. Did you watch CPU load while your code was running? Did it hit 100% on at least one core?