Current scenario: I have 900 files in a directory called directoryA. The files are named file0.txt through file 899.txt, each 15MB in size. I loop through each file sequentially in python. Each file I load as a list, do some operations, and write out an output file in directoryB. When the loop ends I have 900 files in directoryB. The files are named out0.csv through out899.csv.
Problem: The processing of each file takes 3 minutes, making the script run for more than 40 hours. I would like to run the process in a parallel manner as all the files are independent of each other (do not have any inter-dependencies). I have 12 cores in my machine.
The below script runs sequentially. Please help me run it parallel. I have looked at some of the parallel processing modules in python using related stackoverflow questions, but they are difficult for me to understand as I dont have much exposure to python. Thanks a billion.
Pseudo Script
from os import listdir
import csv
mypath = "some/path/"
inputDir = mypath + 'dirA/'
outputDir = mypath + 'dirB/'
for files in listdir(inputDir):
#load the text file as list using csv module
#run a bunch of operations
#regex the int from the filename. for ex file1.txt returns 1, and file42.txt returns 42
#write out a corresponsding csv file in dirB. For example input file file99.txt is written as out99.csv