Multiprocessing so slow

2019-08-25 14:27发布

问题:

I have a function that does the following :

  • Take a file as input and does basic cleaning.
  • Extract the required items from the file and then write them in a pandas dataframe.
  • The dataframe is finally converted into csv and written into a folder.

This is the sample code:

def extract_function(filename):  
   with open(filename,'r') as f:  
       input_data=f.readlines()  
   try:
     // some basic searching pattern matching extracting  
     // dataframe creation with 10 columns and then extracted values are filled in
        empty dataframe
     // finally df.to_csv()

if __name__ == '__main__':
   pool_size = multiprocessing.cpu_count()
   filenames=os.listdir("/home/Desktop/input")
   pool=multiprocessing.Pool(pool_size)
   pool.map(extract_function,filenames)
   pool.close()
   pool.join()

The total number of files in the input folder is 4000. I used multiprocessing, as running the program normally with for loop was taking some time. Below are the executions times of both approaches:

Normal CPU processing = 139.22 seconds
Multiprocessing = 18.72 seconds

My system specification are :

Intel i5 7th gen, 12gb ram, 1Tb hdd, Ubuntu 16.04

While running the program for the 4000 files all the cores are getting fully used(averaging around 90% each core). So I decided to increase the file size and repeat the process. This time the input file number was increased from 4000 to 1,20,000. But this time while running the code the cpu usage was erratic at start and after some time the utilization went down (avearge usage around 10% per core). The ram utilization is also low averaging at 4gb max (remaining 8gb free). With the 4000 files as input the file writing to csv was fast as at an instant as i could see a jump or around 1000 files or more in an instant. But with the 1,20,000 files as input, the file writing slowed down to some 300 files and this slowing down goes linearly and after sometime the file writing became around 50-70 for an instant. All this time the majority of the ram is free. I restarted the machine and tried the same to clear any unwanted zombie process but still, the result is the same.

What is the reason for this ? How can I achieve the same multiprocessing for large files?

Note :
* Each file size average around 300kb.
* Each output file being written will be around 200bytes.
* Total number of files is 4080. Hence total size would be ~1.2gb.
* This same 4080 files was used to make copies to get 1,20,000 files.
* This program is an experiment to check multiprocessing for large number of files.

Update 1

I have tried the same code in a much more powerful machine.

Intel i7 8th gen 8700, 1Tb SSHD & 60gb ram.

. The file writing was much faster than in normal HDD. The program took:

  • For 4000 files - 3.7sec
  • For 1,20,000 files - 2min

Some point of time during the experiment, I got the fastest completion time which is 84sec. At that point in time, it was giving me consistent result while trying two times consecutively. Thinking that it may be because I had correctly set the number of thread factor in the pool size, I restarted and tried again. But this time it was much slower. To give a perspective, during normal runs around 3000-4000 files will be written in a second or two but this time it was writing below 600 files in a second. In this case, also the ram was not being used at all. The CPU even though the multiprocessing module is being used, all the cores just averages around 3-7% utilization.

回答1:

Reading from and writing to disk is slow, compared to running code and data from RAM. It is extremely slow compared to running code and data from the internal cache in the CPU.

In an attempt to make this faster, several caches are used.

  1. A harddisk generally has a built-in cache. In 2012 I did some write testing on this. With the harddisk's write cache disabled writing speed dropped from 72 MiB/s to 12 MiB/s.
  2. Most operating systems today use otherwise unoccupied RAM as a disk cache.
  3. The CPU has several levels of built-in caches as well.

(Usually there is a way to disable caches 1 and 2. If you try that you'll see read and write speed drop like a rock.)

So my guess is that once you pass a certain number of files, you exhaust one or more of the caches, and disk I/O becomes the bottleneck.

To verify, you would have to add code to extract_function to measure 3 things:

  • How long it takes to read the data from disk.
  • How long it takes to do the calculations.
  • How long it takes to write the CSV.

Have extract_function return a tuple of those three numbers, and analyse them. Instead of map, I would advise to use imap_unordered, so you can start evaluating the numbers as soon as they become available.

If disk I/O turns out to be the problem, consider using an SSD.



回答2:

As @RolandSmith & @selbie suggested, I avoided the IO continuous write into CSV files by replacing it with data frames and appending to it. This I think cleared the inconsistencies. I checked the "feather" and "paraquet" high-performance IO modules as suggested by @CoMartel but I think it's for compressing large files into a smaller data frame structure. The appending options were not there for it.

Observations:

  • The program runs slow for the first run. The successive runs will be faster. This behavior was consistent.
  • I have checked for some trailing python process running after the program completion but couldn't find any. So some kind of caching is there within the CPU/RAM which make the program execution faster for the successive runs.

The program for 4000 input files took 72 sec for first-time execution and then an average of 14-15 sec for all successive runs after that.

  • Restarting the system clears those cache and causes the program to run slower for the first run.

  • Average fresh run time is 72 sec. But killing the program as soon as it starts and then running it took 40 sec for the first dry run after termination. The average of 14 sec after all successive runs.

  • During the fresh run, all core utilization will be around 10-13%. But after all the successive runs, the core utilization will be 100%.

Checked with the 1,20,000 files, it follows the same pattern. So, for now, the inconsistency is solved. So if such a code needs to be used as a server a dry run should be made for the CPU/RAM to get cached before it can start to accept API queries for faster results.