Is it a good idea to read/write files in parallel?

2019-05-24 18:51发布

I have a large number of data files that describe the weather at a large number of weather stations. These files are hourly, and are contained in files separated by date.

For example:

20100101.csv
20100102.csv
20100103.csv
.
.
20140228.csv

I need to aggregate the data by station and then write it to disk. That is, for each of those weather files, I need to extract the data for station i and then write it to disk.

The output:

station_001.csv
station_002.csv
.
.
station_999.csv

To speed things up, I decided to read in the daily files in parallel using the foreach and doMC packages, and I also write the station files to disk after aggregating by stations in parallel.

More specifically, I used foreach to read in the files and combined them using .combine="rbind" (I have enough memory to make one huge data set in memory). Afterwards, I have another foreach loop where I subset the data by station and then write to disk. I found that I experienced a very good speed boost by doing the read/write in parallel.

My question is: Is it a good idea to read/write in parallel? I made sure that different threads do not read the same data file or write to the same station file, but after some googling, it seems that it may not be a good idea to parallelize i/o tasks? (Example I found saying no to parallel Input/Output, and a post on R-bloggers showing parallel reads)

2条回答
爱情/是我丢掉的垃圾
2楼-- · 2019-05-24 19:27

You say you notice performance improvements, so then it's obviously a good idea for you.

Additional ways to speed things up - check out fread in data.table - that will speed up sequential reading significantly (by a factor of 3 or more). Using rbindlist (also from data.table) to combine should also provide speed ups (example usage with foreach here - R foreach with .combine=rbindlist).

查看更多
Bombasti
3楼-- · 2019-05-24 19:31

Performance Pro

  • Using multiple threads can increase performance on a multi-core machine

Performance Con

  • When reading from disk, CPU performance is typically not your bottleneck. Files on disk are, more often than not, written in as many sequential blocks as possible. This means that the pointer on your spinning disk does not have to move as far to read the next segment. If you perform the task in parallel, the pointer has to move repeatedly to pick up wherever it left off. This means that your disk write speed will ultimately be slower*.

    *Solid-state drives may not have this problem (I don't know much about SSD's, but I imagine they aren't impacted at all by context switching).

查看更多
登录 后发表回答