I have a list dataframe_chunk
which contains chunks of a very large pandas dataframe.I would like to write every single chunk into a different csv, and to do so in parallel. However, I see the files being written sequentially and I'm not sure why this is the case. Here's the code:
import concurrent.futures as cfu
def write_chunk_to_file(chunk, fpath):
chunk.to_csv(fpath, sep=',', header=False, index=False)
pool = cfu.ThreadPoolExecutor(N_CORES)
futures = []
for i in range(N_CORES):
fpath = '/path_to_files_'+str(i)+'.csv'
futures.append(pool.submit( write_chunk_to_file(dataframe_chunk[i], fpath)))
for f in cfu.as_completed(futures):
print("finished at ",time.time())
Any clues?
One thing that is stated in the Python 2.7.x threading
docs
but not in the 3.x docs is that
Python cannot achieve true parallelism using the threading
library - only one thread will execute at a time.
You should try using concurrent.futures
with the ProcessPoolExecutor
which uses separate processes for each job and therefore can achieve true parallelism on a multi-core CPU.
Update
Here is your program adapted to use the multiprocessing
library instead:
#!/usr/bin/env python3
from multiprocessing import Process
import os
import time
N_CORES = 8
def write_chunk_to_file(chunk, fpath):
with open(fpath, "w") as f:
for x in range(10000000):
f.write(str(x))
futures = []
print("my pid:", os.getpid())
input("Hit return to start:")
start = time.time()
print("Started at:", start)
for i in range(N_CORES):
fpath = './tmp/file-'+str(i)+'.csv'
p = Process(target=write_chunk_to_file, args=(i,fpath))
futures.append(p)
for p in futures:
p.start()
print("All jobs started.")
for p in futures:
p.join()
print("All jobs finished at ",time.time())
You can monitor the jobs with this shell command in another window:
while true; do clear; pstree 12345; ls -l tmp; sleep 1; done
(Replace 12345 with the pid emitted by the script.)