I'm fairly new to Python, and I'm trying to write some huge lists (with random letters inside). Actually it takes me around 75 - 80 seconds on my machine for 2,000,000 lines.
import timeit
import random, string
global_tab = []
global_nb_loop = 2000000
print("Generate %d lines" % global_nb_loop)
global_tab = []
for x in range(global_nb_loop):
global_tab.append(("".join( [random.choice(string.ascii_letters) for i in range(15)] ), "".join( [random.choice(string.digits) for i in range(2)])))
print("%d lines generated" % len(global_tab))
And the result with linux time
command:
$ time python3 DEV/PyETL/generateList.py
Generate 2000000 lines
2000000 lines generated
real 1m16.844s
user 1m16.609s
sys 0m0.203s
I was surprised when monitoring system resources that only 1 core was at 100%, instead of 4 like on a Windows machine on which I've tested this too.
Of course I've tried to apply some threads, but I'm facing a problem: it takes more time than running on a single core. Maybe threads are not the solution or I'm probably using them wrong.
Here is the new code:
import random, string
import threading
global_tab = []
global_nb_threads = 4
global_nb_loop = 2000000
threadLock = threading.Lock()
class generateList(threading.Thread):
def __init__(self, name):
threading.Thread.__init__(self)
self.name = name
def run(self):
global global_tab
self.tab = []
print("[%s] Generate %d lines" % (self.name, int(global_nb_loop/global_nb_threads)))
# divide desirated lines with number of threads
for x in range(int(global_nb_loop/global_nb_threads)):
self.tab.append(("".join( [random.choice(string.ascii_letters) for i in range(15)] ), "".join( [random.choice(string.digits) for i in range(2)])))
threadLock.acquire()
global_tab += self.tab
threadLock.release()
del self.tab
print("[%s] %d lines in list" % (self.name, len(global_tab)))
for i in range(global_nb_threads):
# Create threads
t = generateList("Thread-" + str(i))
# Start
t.start()
for i in range(global_nb_threads):
# Wait for threads end
t.join()
And the execution:
$ time python3 DEV/PyETL/generateListThreads.py
[Thread-0] Generate 500000 lines
[Thread-1] Generate 500000 lines
[Thread-2] Generate 500000 lines
[Thread-3] Generate 500000 lines
[Thread-3] 500000 lines in list
[Thread-0] 1000000 lines in list
[Thread-2] 1500000 lines in list
[Thread-1] 2000000 lines in list
real 1m40.858s
user 1m41.208s
sys 0m0.916s
32 seconds more than 1 core with 100%, but monitoring shows that the 8 cores were with 20 - 40% load at the same time.
Since all threads are working at the same time, generating fewer rows and synchronizing only for updating a global variable, shouldn't the execution time be lower than a single core?
I am pretty sure your lock is not necessary and is slowing you down. (edit: actually, I just noticed the lock is used after the majority of the work is done, so isn't really relevant.)
global_tab += self.tab
is (I think) atomic through the Python GIL. (Actually, this only claims list.extend()
, so use that instead. Here's another reference: Are lists thread safe?
Alternatively, I would try multiprocessing.imap_unordered
with a large chunksize. The downside is the results are sent over by stream, but your random string processing might overshadow that.
import multiprocessing
import random
import string
def randomword(x):
return ''.join(random.choice(string.ascii_letters) for i in range(15))
pool = multiprocessing.Pool(8)
results = pool.imap_unordered(randomword, range(100))
print([r for r in results])
For 2 million strings (I changed it to print the length):
$ time python r.py
2000000
real 0m38.305s
user 1m31.717s
sys 0m25.853s
I also tried cleaning up your version a bit and got:
$ time python rr.py
[Thread-0] Generate 250000 lines
[Thread-1] Generate 250000 lines
[Thread-2] Generate 250000 lines
[Thread-3] Generate 250000 lines
[Thread-4] Generate 250000 lines
[Thread-5] Generate 250000 lines
[Thread-6] Generate 250000 lines
[Thread-7] Generate 250000 lines
[Thread-4] 250000 lines in list
[Thread-1] 500000 lines in list
[Thread-7] 750000 lines in list
[Thread-0] 1000000 lines in list
[Thread-6] 1250000 lines in list
[Thread-2] 1500000 lines in list
[Thread-3] 1750000 lines in list
[Thread-5] 2000000 lines in list
real 0m22.113s
user 0m24.969s
sys 0m5.537s
A couple of significant changes:
- use
xrange()
on the large ranges (ah, python3 already does this.)
- remove the threadlock
- use
extend()
on the global.
(my results were about the same when just appending to the global_tab
, btw, and leaving out the temporary list.)
import random, string
import threading
global_tab = []
global_nb_threads = 8
global_nb_loop = 2000000
class generateList(threading.Thread):
def __init__(self, name):
threading.Thread.__init__(self)
self.name = name
def run(self):
global global_tab
self.tab = []
print("[%s] Generate %d lines" % (self.name, int(global_nb_loop/global_nb_threads)))
for x in range(int(global_nb_loop/global_nb_threads)):
self.tab.append(("".join( [random.choice(string.ascii_letters) for i in range(15)] ), "".join( [random.choice(string.digits) for i in range(2)])))
global_tab.extend(self.tab)
print("[%s] %d lines in list" % (self.name, len(global_tab)))
for i in range(global_nb_threads):
t = generateList("Thread-" + str(i))
t.start()
for i in range(global_nb_threads):
t.join()
...but, single threaded is still slightly faster at 16 seconds.
If I tune multiprocessing
, I can get it down to 6 seconds:
size = 2000000
processes = 8
pool = multiprocessing.Pool(processes)
results = [r for r in pool.imap_unordered(randomword, range(size), chunksize=int(size/processes))]
print(len(results))
output:
$ time python r.py
2000000
real 0m5.713s
user 0m35.594s
sys 0m0.546s
...so I think that's my final answer: Use multiprocessing
.
From python threading docs:
CPython implementation detail: In CPython, due to the Global
Interpreter Lock, only one thread can execute Python code at once
(even though certain performance-oriented libraries might overcome
this limitation). If you want your application to make better use of
the computational resources of multi-core machines, you are advised to
use multiprocessing. However, threading is still an appropriate model
if you want to run multiple I/O-bound tasks simultaneously.
Basically this means that threads in python don't improve performance unless the threads are mostly waiting for something to happen. Multiprocessing works pretty well in python, but because the processes don't share any objects or global state the programming model is a little different for multiprocessing. Here is an example of how you could use multiprocessing:
import multiprocessing
import random
import string
def randomData(i):
data = ("".join(random.sample(string.ascii_letters, 15)),
"".join(random.sample(string.digits, 2)))
return data
global_nb_loop = 2000000
pool = multiprocessing.Pool(8)
results = pool.imap(randomData, xrange(global_nb_loop))
global_tab = list(results)
print len(global_tab)
The multiprocessing
module has a lot of versions of map
and apply
, ie imap
, map_async
and so on. Look through the docs to find the one that's best for your problem.
Since you are working with large amounts of data, I recommend a look at numpy. Generally numpy is slower than lists but more memory efficient, and really good for many vectorized operations. You can always go down the multiprocessing route, even with numpy.
Here is a version that runs 3x faster than the original problem (for reference, the original version ran in 30.3 s on my machine).
import numpy as np
def numpy_test(N=2000000):
global_nb_loop = N
global_tab = []
asc_list = list('abcdefghijklmnopqrstuvwxyz')
print("Generate %d lines" % global_nb_loop)
global_tab = [(u.tostring(),str(v)) for u,v in zip( np.random.choice(asc_list, (N, 15)), np.random.randint(10, 100, N) )]
print("%d lines generated" % len(global_tab))
In [306]: %timeit numpy_test()
Generate 2000000 lines
2000000 lines generated
Generate 2000000 lines
2000000 lines generated
Generate 2000000 lines
2000000 lines generated
Generate 2000000 lines
2000000 lines generated
1 loop, best of 3: 11.1 s per loop