I want to use more processors to run my code to minimize the running time only. Though I have tried to do it but failed to get the desired result. My code is a very big one that's why I'm giving here a very small and simple code (though it does not need parallel job to run this code) just to know how can I do parallel job in python. Any comments/ suggestions will be highly appreciated.
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import odeint
def solveit(n,y0):
def exam(y, x):
theta, omega = y
dydx = [omega, - (2.0/x)*omega - theta**n]
return dydx
x = np.linspace(0.1, 10, 100)
#call integrator
sol = odeint(exam, y0, x)
plt.plot(x, sol[:, 0], label='For n = %s,y0=(%s,%s)'%(n,y0[0],y0[1]))
ys= [[1.0, 0.0],[1.2, 0.2],[1.3, 0.3]]
fig = plt.figure()
for y_ in ys:
solveit(1.,y_)
plt.legend(loc='best')
plt.grid()
plt.show()
Q: How can I use more CPU to run my python script?
A few remarks first, on "The Factors of The Game" how any more CPU might at all get counted into the flow of the processing-tasks execution:
( detailed examples follow )
- The Costs of going to achieve a reasonable speedup from a re-organise'd process-flow from an as-is state into a feasible parallel-code execution fashion
- Known python Limits for executing any parallel computing-intensive strategy to know about
- python script itself, i.e. The Code will look way different, most if doing an attempt to harness an MPI-distributed memory parallelism, operated "across" a set of {cluster|grid}-connected-machines
Principal Disambiguation :
Standard python
always remains as a pure [SERIAL]
interpreter, always.
[PARALLEL] is not [CONCURRENT]
[PARALLEL]
process flow is the most complicated form of process-flow organisation: parallelised processes must start, execute and also complete at the same time, typically within a time-constraint, so any indeterministic blocking or other source of uncertainty ought be avoided (not "just" mitigated on-the-fly, avoided, principally prevented - and that is hard)
[CONCURRENT]
process flow is way easier to achieve, given there are more free resources, the concurrency-policy based process-scheduler can direct some work-streams ( threads ) to start being executed on such a free resource ( disk-I/O, CPU-execution, etc ) and also can "enforce" such work being soft-signalled or force-fully interrupted after some scheduler's side decided amount of time and temporarily evicted from using a "just-for-a-moment-lended" device/resource, so as another work-stream ( thread ) candidate's turn has come, after indeterministically long or priority-driven waiting in the scheduler's concurrent-scheduling policy queue took place.
[SERIAL]
process flow is the simplest form - one step after another after another without any stress from real-time passing along - "mañana (maˈɲana; English məˈnjɑːnə) n, adv .. b. some other and later time"
Python interpreter has been since ever damned-[SERIAL]
, even when syntax-constructors have brought tools for both { lightweight-THREAD
-based | heavyweight-full-copy-PROCESS
}-based forms of "concurrent"-code-invocations
Lightweight form is know to still rely on python-GIL-lock, which makes the actual execution re-[SERIAL]
-ised again, right by temporarily lending the central interpreters' GIL-lock in a round-robin fashion, driven by a constant amount of time to whatever big herd-of-THREADs. The result is finally [SERIAL]
again and this can be useful for "external"-latency-masking (example), but never for HPC-grade computing...
Even the GIL-escaping attempts to pay all costs and harness the heavyweight-form of the full-copy-PROCESS
-based [CONCURRENT]
-code execution are not free from headaches - just read carefully the warnings about crashes and hung the few, very rare resources after leaks, till the next platform reboot(!):
Changed in version 3.8: On macOS, the spawn
start method is now the default. The fork
start method should be considered unsafe as it can lead to crashes of the subprocess. See bpo-33725.
Changed in version 3.4: spawn
added on all unix platforms, and forkserver
added for some unix platforms. Child processes no longer inherit all of the parents inheritable handles on Windows.
On Unix using the spawn
or forkserver
start methods will also start a resource tracker process which tracks the unlinked named system resources (such as named semaphores or SharedMemory
objects) created by processes of the program. When all processes have exited the resource tracker unlinks any remaining tracked object. Usually there should be none, but if a process was killed by a signal there may be some “leaked” resources. (Neither leaked semaphores nor shared memory segments will be automatically unlinked until the next reboot. This is problematic for both objects because the system allows only a limited number of named semaphores, and shared memory segments occupy some space in the main memory.)
We will be most of the time happy with a good code-design, polished for the python, augmented with some sorts of smart-vectorisation and [CONCURRENT]
processing organisation.
The true [PARALLEL]
code execution is a thing most probably no one would ever try to implement inside deterministically GIL-interrupted python [SERIAL]
-code interpreter ( as of the 2019-3Q, this Game seems obvious to have already been lost a priori ).
Costs - expenses one need not see, but always has to pay :
Costs are present, always.
Smaller for THREAD-based attempts, larger for PROCESS-based attemtps, biggest for refactoring the code into distributed-memory parallelism ( using MPI-inter-process communications mediating tools or other form of going distributed )
Each syntax-trick has some add-on costs, i.e. how long does it take in [TIME]
and how big add-on memory-allocations in [SPACE]
does it take, before the "internal-part" ( the useful code ) starts to work for us ( and hopefully accelerate the overall run-time ). If these add-on costs for a lumpsum of ( processing-setup costs + parameters-transfer costs + coordination-and-communication costs + collection-of-results costs + processing-termination costs ) are the same, the worse higher than a sought for acceleration, you suddenly find yourself to pay more than you receive.
When not having a final working code for testing the hot-spot, one may inject something like this crash-test-dummy code, the CPU and RAM will get a stress-test workload:
##########################################################################
#-EXTERNAL-zmq.Stopwatch()'d-.start()-.stop()-clocked-EXECUTION-----------
#
def aFATpieceOfRAMallocationAndNUMPYcrunching( aRAM_size_to_allocate = 1E9,
aCPU_load_to_generate = 20
):
#-XTRN-processing-instantiation-COSTs
#---------------------------------------------------------------------
#-ZERO-call-params-transfer-COSTs
#---------------------------------------------------------------------
#-HERE---------------------------------RAM-size'd-STRESS-TEST-WORKLOAD
_ = numpy.random.randint( -127,
127,
size = int( aRAM_size_to_allocate ),
dtype = numpy.int8
)
#---------------------------------------------------------------------
#-HERE-----------------------------------CPU-work-STRESS-TEST-WORKLOAD
# >>> aClk.start();_ = numpy.math.factorial( 2**f );aClk.stop()
# 30 [us] for f = 8
# 190 [us] for f = 10
# 1 660 [us] for f = 12
# 20 850 [us] for f = 14
# 256 200 [us] for f = 16
# 2 625 728 [us] for f = 18
# 27 775 600 [us] for f = 20
# 309 533 629 [us] for f = 22
# +3 ... ... ... [us] for f = 24+ & cluster-scheduler may kill job
# +30 ... ... ... [us] for f = 26+ & cluster-manager may block you
# ... ... ... ... [us] for f = 28+ & cluster-owner will hunt you!
#
return len( str( [ numpy.math.factorial( 2**f )
for f in range( min( 22,
aCPU_load_to_generate
)
)
][-1]
)
) #---- MAY TRY TO return( _.astype( numpy.int64 )
#------ + len( str( [numpy.math.factorial(...)...] ) )
#------ )
#------ TO TEST also the results-transfer COSTs *
#------ yet, be careful +RAM COSTs *
#------ get explode ~8+ times HERE *
#
#-EXTERNAL-ZERO-results-transfer-and-collection-COSTs
#########################################################################
How to avoid facing a final sarcasm of " A lousy bad deal, isn't it? "
Do a fair analysis, benchmark hot-spots and scale beyond a schoolbook example sizes of data well before you spend your time and budget. "Just coding" does not work here.
Why?
A single "wrong" SLOC may devastate the resulting performance into more than about +37% longer time or may improve performance to spend less than -57% of the baseline processing time.
Pre-mature optimisations are awfully dangerous.
Costs/benefits analysis tells the facts before spending your expenses. Amdahl's law may help you decide a breakeven point and gives one also a principal limit, after which any number of free resources ( even infinitely many resouces ( watch this fully interactive analysis and try to move the p
-slider, for the [PARALLEL]
-fraction of the processing, anywhere lower than the un-realistic 100% parallel-code, so as to smell the smoke of the real-life fire ) ) will not yield a bit of speedup for your code processing-flow.
Hidden gems one will always like :
Smart vectorised tricks in performance-polished libraries like numpy
, scipy
et al, can and will internally use multiple CPU-cores, without python knowing or taking care about that. Learn vectorised-code tricks and your code will benefit a lot.
Also a numba
LLVM compiler can help in cases, where ultimate performance ought be squeezed from your CPU-engine, where code cannot rely on use of the smart numpy
performance tricks.
Yet harder could be to go into other {pre|jit}-compiled-fashions of python-code, so as to escape from the trap of GIL-lock still-[SERIAL]
-stepping of a code-execution.
Wrap-up :
Having as much as possible CPU-cores is fine, always. Harnessing all such CPU-cores available locally in a multiprocessor chip, the worse in a NUMA-architecture fabric, the worst in a distributed ecosystem of separate, loosely-coupled set of at least connected computing nodes ( MPI and other forms of message-based coordination of otherwise autonomous computing nodes ).
Though the real costs of "getting em' indeed work for you" could be higher than a benefit of actually doing it ( re-factoring + debugging + proof-of-correctness + actual work + collecting of results ).
The Parkinsons Law is clear - if something may get wrong, it gets in such a moment that it can cause the maximum harm.
:o) so be optimistic on the way forward - it will be a wild ride, I can promise you