Reduce execution time on huge list generation

2019-05-23 04:19发布

问题:

I'm fairly new to Python, and I'm trying to write some huge lists (with random letters inside). Actually it takes me around 75 - 80 seconds on my machine for 2,000,000 lines.

import timeit
import random, string

global_tab     = []
global_nb_loop = 2000000

print("Generate %d lines" % global_nb_loop)
global_tab = []
for x in range(global_nb_loop):
    global_tab.append(("".join( [random.choice(string.ascii_letters) for i in range(15)] ), "".join( [random.choice(string.digits) for i in range(2)])))
print("%d lines generated" % len(global_tab))

And the result with linux time command:

$ time python3 DEV/PyETL/generateList.py 
Generate 2000000 lines
2000000 lines generated

real    1m16.844s
user    1m16.609s
sys 0m0.203s

I was surprised when monitoring system resources that only 1 core was at 100%, instead of 4 like on a Windows machine on which I've tested this too.

Of course I've tried to apply some threads, but I'm facing a problem: it takes more time than running on a single core. Maybe threads are not the solution or I'm probably using them wrong.

Here is the new code:

import random, string
import threading

global_tab         = []
global_nb_threads  = 4
global_nb_loop     = 2000000


threadLock         = threading.Lock()

class generateList(threading.Thread):
    def __init__(self, name):
        threading.Thread.__init__(self)
        self.name = name

    def run(self):
        global global_tab
        self.tab = []

        print("[%s] Generate %d lines" % (self.name, int(global_nb_loop/global_nb_threads)))
        # divide desirated lines with number of threads
        for x in range(int(global_nb_loop/global_nb_threads)):
            self.tab.append(("".join( [random.choice(string.ascii_letters) for i in range(15)] ), "".join( [random.choice(string.digits) for i in range(2)])))

        threadLock.acquire()
        global_tab += self.tab
        threadLock.release()
        del self.tab
        print("[%s] %d lines in list" % (self.name, len(global_tab)))


for i in range(global_nb_threads):
    # Create threads
    t = generateList("Thread-" + str(i))
    # Start
    t.start()

for i in range(global_nb_threads):
    # Wait for threads end
    t.join()

And the execution:

$ time python3 DEV/PyETL/generateListThreads.py 
[Thread-0] Generate 500000 lines
[Thread-1] Generate 500000 lines
[Thread-2] Generate 500000 lines
[Thread-3] Generate 500000 lines
[Thread-3] 500000 lines in list
[Thread-0] 1000000 lines in list
[Thread-2] 1500000 lines in list
[Thread-1] 2000000 lines in list    
real    1m40.858s
user    1m41.208s
sys 0m0.916s

32 seconds more than 1 core with 100%, but monitoring shows that the 8 cores were with 20 - 40% load at the same time.

Since all threads are working at the same time, generating fewer rows and synchronizing only for updating a global variable, shouldn't the execution time be lower than a single core?

回答1:

I am pretty sure your lock is not necessary and is slowing you down. (edit: actually, I just noticed the lock is used after the majority of the work is done, so isn't really relevant.)

global_tab += self.tab is (I think) atomic through the Python GIL. (Actually, this only claims list.extend(), so use that instead. Here's another reference: Are lists thread safe?

Alternatively, I would try multiprocessing.imap_unordered with a large chunksize. The downside is the results are sent over by stream, but your random string processing might overshadow that.

import multiprocessing
import random
import string

def randomword(x):
    return ''.join(random.choice(string.ascii_letters) for i in range(15))

pool = multiprocessing.Pool(8)
results = pool.imap_unordered(randomword, range(100))
print([r for r in results])

For 2 million strings (I changed it to print the length):

$ time python r.py                                                                 
2000000

real    0m38.305s
user    1m31.717s
sys     0m25.853s

I also tried cleaning up your version a bit and got:

$ time python rr.py 
[Thread-0] Generate 250000 lines
[Thread-1] Generate 250000 lines
[Thread-2] Generate 250000 lines
[Thread-3] Generate 250000 lines
[Thread-4] Generate 250000 lines
[Thread-5] Generate 250000 lines
[Thread-6] Generate 250000 lines
[Thread-7] Generate 250000 lines
[Thread-4] 250000 lines in list
[Thread-1] 500000 lines in list
[Thread-7] 750000 lines in list
[Thread-0] 1000000 lines in list
[Thread-6] 1250000 lines in list
[Thread-2] 1500000 lines in list
[Thread-3] 1750000 lines in list
[Thread-5] 2000000 lines in list

real    0m22.113s
user    0m24.969s
sys     0m5.537s

A couple of significant changes:

  • use xrange() on the large ranges (ah, python3 already does this.)
  • remove the threadlock
  • use extend() on the global.

(my results were about the same when just appending to the global_tab, btw, and leaving out the temporary list.)

import random, string
import threading

global_tab         = []
global_nb_threads  = 8
global_nb_loop     = 2000000

class generateList(threading.Thread):
    def __init__(self, name):
        threading.Thread.__init__(self)
        self.name = name

    def run(self):
        global global_tab
        self.tab = []

        print("[%s] Generate %d lines" % (self.name, int(global_nb_loop/global_nb_threads)))
        for x in range(int(global_nb_loop/global_nb_threads)):
            self.tab.append(("".join( [random.choice(string.ascii_letters) for i in range(15)] ), "".join( [random.choice(string.digits) for i in range(2)])))

        global_tab.extend(self.tab)
        print("[%s] %d lines in list" % (self.name, len(global_tab)))


for i in range(global_nb_threads):
    t = generateList("Thread-" + str(i))
    t.start()

for i in range(global_nb_threads):
    t.join()

...but, single threaded is still slightly faster at 16 seconds.

If I tune multiprocessing, I can get it down to 6 seconds:

size = 2000000
processes = 8
pool = multiprocessing.Pool(processes)
results = [r for r in pool.imap_unordered(randomword, range(size), chunksize=int(size/processes))]
print(len(results))

output:

$ time python r.py                                                                 
2000000

real    0m5.713s
user    0m35.594s
sys     0m0.546s

...so I think that's my final answer: Use multiprocessing.



回答2:

From python threading docs:

CPython implementation detail: In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.

Basically this means that threads in python don't improve performance unless the threads are mostly waiting for something to happen. Multiprocessing works pretty well in python, but because the processes don't share any objects or global state the programming model is a little different for multiprocessing. Here is an example of how you could use multiprocessing:

import multiprocessing
import random
import string

def randomData(i):
    data = ("".join(random.sample(string.ascii_letters, 15)),
            "".join(random.sample(string.digits, 2)))
    return data

global_nb_loop = 2000000
pool = multiprocessing.Pool(8)
results = pool.imap(randomData, xrange(global_nb_loop))
global_tab = list(results)
print len(global_tab)

The multiprocessing module has a lot of versions of map and apply, ie imap, map_async and so on. Look through the docs to find the one that's best for your problem.



回答3:

Since you are working with large amounts of data, I recommend a look at numpy. Generally numpy is slower than lists but more memory efficient, and really good for many vectorized operations. You can always go down the multiprocessing route, even with numpy.

Here is a version that runs 3x faster than the original problem (for reference, the original version ran in 30.3 s on my machine).

import numpy as np


def numpy_test(N=2000000):
    global_nb_loop = N 
    global_tab     = []
    asc_list = list('abcdefghijklmnopqrstuvwxyz')

    print("Generate %d lines" % global_nb_loop)
    global_tab = [(u.tostring(),str(v)) for u,v in zip( np.random.choice(asc_list, (N, 15)), np.random.randint(10, 100, N) )]
    print("%d lines generated" % len(global_tab))


In [306]: %timeit numpy_test()
Generate 2000000 lines
2000000 lines generated
Generate 2000000 lines
2000000 lines generated
Generate 2000000 lines
2000000 lines generated
Generate 2000000 lines
2000000 lines generated
1 loop, best of 3: 11.1 s per loop