How can I use more CPU to run my python script

2020-03-31 06:09发布

问题:

I want to use more processors to run my code to minimize the running time only. Though I have tried to do it but failed to get the desired result. My code is a very big one that's why I'm giving here a very small and simple code (though it does not need parallel job to run this code) just to know how can I do parallel job in python. Any comments/ suggestions will be highly appreciated.

import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import odeint


def solveit(n,y0):
    def exam(y, x):
        theta, omega = y
        dydx = [omega, - (2.0/x)*omega - theta**n]
        return dydx

    x = np.linspace(0.1, 10, 100)

    #call integrator
    sol = odeint(exam, y0, x)

    plt.plot(x, sol[:, 0], label='For n = %s,y0=(%s,%s)'%(n,y0[0],y0[1]))


ys= [[1.0, 0.0],[1.2, 0.2],[1.3, 0.3]]

fig = plt.figure()
for y_ in ys:
    solveit(1.,y_)

plt.legend(loc='best')
plt.grid()
plt.show() 

回答1:

First of: Beware of parallelization.
It will often cause problems where you weren't expecting them. Especially when you are not experienced with parallelization and your code is not optimized for it.
There are many things you need to look out for. Look at some YouTube tutorials and read up on best practice when using parallelization.
This being said:
If you want to go straight ahead here is an quick introduction to using Python's multiprocessing module: https://sebastianraschka.com/Articles/2014_multiprocessing.html



回答2:

Q: How can I use more CPU to run my python script?

A few remarks first, on "The Factors of The Game" how any more CPU might at all get counted into the flow of the processing-tasks execution:
( detailed examples follow )

  • The Costs of going to achieve a reasonable speedup from a re-organise'd process-flow from an as-is state into a feasible parallel-code execution fashion
  • Known python Limits for executing any parallel computing-intensive strategy to know about
  • python script itself, i.e. The Code will look way different, most if doing an attempt to harness an MPI-distributed memory parallelism, operated "across" a set of {cluster|grid}-connected-machines

Principal Disambiguation :
Standard python always remains as a pure [SERIAL] interpreter, always.
[PARALLEL] is not [CONCURRENT]

[PARALLEL] process flow is the most complicated form of process-flow organisation: parallelised processes must start, execute and also complete at the same time, typically within a time-constraint, so any indeterministic blocking or other source of uncertainty ought be avoided (not "just" mitigated on-the-fly, avoided, principally prevented - and that is hard)

[CONCURRENT] process flow is way easier to achieve, given there are more free resources, the concurrency-policy based process-scheduler can direct some work-streams ( threads ) to start being executed on such a free resource ( disk-I/O, CPU-execution, etc ) and also can "enforce" such work being soft-signalled or force-fully interrupted after some scheduler's side decided amount of time and temporarily evicted from using a "just-for-a-moment-lended" device/resource, so as another work-stream ( thread ) candidate's turn has come, after indeterministically long or priority-driven waiting in the scheduler's concurrent-scheduling policy queue took place.

[SERIAL] process flow is the simplest form - one step after another after another without any stress from real-time passing along - "mañana (maˈɲana; English məˈnjɑːnə) n, adv .. b. some other and later time"

Python interpreter has been since ever damned-[SERIAL], even when syntax-constructors have brought tools for both { lightweight-THREAD-based | heavyweight-full-copy-PROCESS }-based forms of "concurrent"-code-invocations

Lightweight form is know to still rely on python-GIL-lock, which makes the actual execution re-[SERIAL]-ised again, right by temporarily lending the central interpreters' GIL-lock in a round-robin fashion, driven by a constant amount of time to whatever big herd-of-THREADs. The result is finally [SERIAL] again and this can be useful for "external"-latency-masking (example), but never for HPC-grade computing...

Even the GIL-escaping attempts to pay all costs and harness the heavyweight-form of the full-copy-PROCESS-based [CONCURRENT]-code execution are not free from headaches - just read carefully the warnings about crashes and hung the few, very rare resources after leaks, till the next platform reboot(!):

Changed in version 3.8: On macOS, the spawn start method is now the default. The fork start method should be considered unsafe as it can lead to crashes of the subprocess. See bpo-33725.

Changed in version 3.4: spawn added on all unix platforms, and forkserver added for some unix platforms. Child processes no longer inherit all of the parents inheritable handles on Windows.

On Unix using the spawn or forkserver start methods will also start a resource tracker process which tracks the unlinked named system resources (such as named semaphores or SharedMemory objects) created by processes of the program. When all processes have exited the resource tracker unlinks any remaining tracked object. Usually there should be none, but if a process was killed by a signal there may be some “leaked” resources. (Neither leaked semaphores nor shared memory segments will be automatically unlinked until the next reboot. This is problematic for both objects because the system allows only a limited number of named semaphores, and shared memory segments occupy some space in the main memory.)

We will be most of the time happy with a good code-design, polished for the python, augmented with some sorts of smart-vectorisation and [CONCURRENT] processing organisation.

The true [PARALLEL] code execution is a thing most probably no one would ever try to implement inside deterministically GIL-interrupted python [SERIAL]-code interpreter ( as of the 2019-3Q, this Game seems obvious to have already been lost a priori ).


Costs - expenses one need not see, but always has to pay :

Costs are present, always.

Smaller for THREAD-based attempts, larger for PROCESS-based attemtps, biggest for refactoring the code into distributed-memory parallelism ( using MPI-inter-process communications mediating tools or other form of going distributed )

Each syntax-trick has some add-on costs, i.e. how long does it take in [TIME] and how big add-on memory-allocations in [SPACE] does it take, before the "internal-part" ( the useful code ) starts to work for us ( and hopefully accelerate the overall run-time ). If these add-on costs for a lumpsum of ( processing-setup costs + parameters-transfer costs + coordination-and-communication costs + collection-of-results costs + processing-termination costs ) are the same, the worse higher than a sought for acceleration, you suddenly find yourself to pay more than you receive.

When not having a final working code for testing the hot-spot, one may inject something like this crash-test-dummy code, the CPU and RAM will get a stress-test workload:

##########################################################################
#-EXTERNAL-zmq.Stopwatch()'d-.start()-.stop()-clocked-EXECUTION-----------
#
def aFATpieceOfRAMallocationAndNUMPYcrunching( aRAM_size_to_allocate =  1E9,
                                               aCPU_load_to_generate = 20
                                               ):
    #-XTRN-processing-instantiation-COSTs
    #---------------------------------------------------------------------
    #-ZERO-call-params-transfer-COSTs
    #---------------------------------------------------------------------
    #-HERE---------------------------------RAM-size'd-STRESS-TEST-WORKLOAD
    _ = numpy.random.randint( -127,
                               127,
                               size  = int( aRAM_size_to_allocate ),
                               dtype = numpy.int8
                               )
    #---------------------------------------------------------------------
    #-HERE-----------------------------------CPU-work-STRESS-TEST-WORKLOAD
    # >>> aClk.start();_ = numpy.math.factorial( 2**f );aClk.stop()
    #              30 [us] for f =  8
    #             190 [us] for f = 10
    #           1 660 [us] for f = 12
    #          20 850 [us] for f = 14
    #         256 200 [us] for f = 16
    #       2 625 728 [us] for f = 18
    #      27 775 600 [us] for f = 20
    #     309 533 629 [us] for f = 22
    #  +3 ... ... ... [us] for f = 24+ & cluster-scheduler may kill job
    # +30 ... ... ... [us] for f = 26+ & cluster-manager may block you
    # ... ... ... ... [us] for f = 28+ & cluster-owner will hunt you!
    #
    return len( str( [ numpy.math.factorial( 2**f )
                                            for f in range( min( 22,
                                                                 aCPU_load_to_generate
                                                                 )
                                                            )
                       ][-1]
                     )
                ) #---- MAY TRY TO return( _.astype(  numpy.int64 )
                #------                  + len( str( [numpy.math.factorial(...)...] ) )
                #------                    )
                #------         TO TEST also the results-transfer COSTs *
                #------                      yet, be careful +RAM COSTs *
                #------                      get explode ~8+ times HERE *
#
#-EXTERNAL-ZERO-results-transfer-and-collection-COSTs
#########################################################################

How to avoid facing a final sarcasm of " A lousy bad deal, isn't it? "

Do a fair analysis, benchmark hot-spots and scale beyond a schoolbook example sizes of data well before you spend your time and budget. "Just coding" does not work here.

Why?
A single "wrong" SLOC may devastate the resulting performance into more than about +37% longer time or may improve performance to spend less than -57% of the baseline processing time.

Pre-mature optimisations are awfully dangerous.

Costs/benefits analysis tells the facts before spending your expenses. Amdahl's law may help you decide a breakeven point and gives one also a principal limit, after which any number of free resources ( even infinitely many resouces ( watch this fully interactive analysis and try to move the p-slider, for the [PARALLEL]-fraction of the processing, anywhere lower than the un-realistic 100% parallel-code, so as to smell the smoke of the real-life fire ) ) will not yield a bit of speedup for your code processing-flow.


Hidden gems one will always like :

Smart vectorised tricks in performance-polished libraries like numpy, scipy et al, can and will internally use multiple CPU-cores, without python knowing or taking care about that. Learn vectorised-code tricks and your code will benefit a lot.

Also a numba LLVM compiler can help in cases, where ultimate performance ought be squeezed from your CPU-engine, where code cannot rely on use of the smart numpy performance tricks.

Yet harder could be to go into other {pre|jit}-compiled-fashions of python-code, so as to escape from the trap of GIL-lock still-[SERIAL]-stepping of a code-execution.


Wrap-up :

Having as much as possible CPU-cores is fine, always. Harnessing all such CPU-cores available locally in a multiprocessor chip, the worse in a NUMA-architecture fabric, the worst in a distributed ecosystem of separate, loosely-coupled set of at least connected computing nodes ( MPI and other forms of message-based coordination of otherwise autonomous computing nodes ).

Though the real costs of "getting em' indeed work for you" could be higher than a benefit of actually doing it ( re-factoring + debugging + proof-of-correctness + actual work + collecting of results ).

The Parkinsons Law is clear - if something may get wrong, it gets in such a moment that it can cause the maximum harm.

:o) so be optimistic on the way forward - it will be a wild ride, I can promise you