I have a function that does the following :
- Take a file as input and does basic cleaning.
- Extract the required items from the file and then write them in a pandas dataframe.
- The dataframe is finally converted into csv and written into a folder.
This is the sample code:
def extract_function(filename):
with open(filename,'r') as f:
input_data=f.readlines()
try:
// some basic searching pattern matching extracting
// dataframe creation with 10 columns and then extracted values are filled in
empty dataframe
// finally df.to_csv()
if __name__ == '__main__':
pool_size = multiprocessing.cpu_count()
filenames=os.listdir("/home/Desktop/input")
pool=multiprocessing.Pool(pool_size)
pool.map(extract_function,filenames)
pool.close()
pool.join()
The total number of files in the input folder is 4000
. I used multiprocessing, as running the program normally with for loop was taking some time. Below are the executions times of both approaches:
Normal CPU processing = 139.22 seconds
Multiprocessing = 18.72 seconds
My system specification are :
Intel i5 7th gen, 12gb ram, 1Tb hdd, Ubuntu 16.04
While running the program for the 4000 files all the cores are getting fully used(averaging around 90% each core). So I decided to increase the file size and repeat the process. This time the input file number was increased from 4000
to 1,20,000
. But this time while running the code the cpu usage was erratic at start and after some time the utilization went down (avearge usage around 10% per core). The ram utilization is also low averaging at 4gb max (remaining 8gb free). With the 4000 files as input the file writing to csv was fast as at an instant as i could see a jump or around 1000 files or more in an instant. But with the 1,20,000 files as input, the file writing slowed down to some 300 files and this slowing down goes linearly and after sometime the file writing became around 50-70 for an instant. All this time the majority of the ram is free. I restarted the machine and tried the same to clear any unwanted zombie process but still, the result is the same.
What is the reason for this ? How can I achieve the same multiprocessing for large files?
Note :
* Each file size average around 300kb.
* Each output file being written will be around 200bytes.
* Total number of files is 4080. Hence total size would be ~1.2gb.
* This same 4080 files was used to make copies to get 1,20,000 files.
* This program is an experiment to check multiprocessing for large number of files.
Update 1
I have tried the same code in a much more powerful machine.
Intel i7 8th gen 8700, 1Tb SSHD & 60gb ram.
. The file writing was much faster than in normal HDD. The program took:
- For 4000 files - 3.7sec
- For 1,20,000 files - 2min
Some point of time during the experiment, I got the fastest completion time which is 84sec. At that point in time, it was giving me consistent result while trying two times consecutively. Thinking that it may be because I had correctly set the number of thread factor in the pool size, I restarted and tried again. But this time it was much slower. To give a perspective, during normal runs around 3000-4000 files will be written in a second or two but this time it was writing below 600 files in a second. In this case, also the ram was not being used at all. The CPU even though the multiprocessing module is being used, all the cores just averages around 3-7% utilization.