I have a large number of data files that describe the weather at a large number of weather stations. These files are hourly, and are contained in files separated by date.
For example:
20100101.csv
20100102.csv
20100103.csv
.
.
20140228.csv
I need to aggregate the data by station and then write it to disk. That is, for each of those weather files, I need to extract the data for station i and then write it to disk.
The output:
station_001.csv
station_002.csv
.
.
station_999.csv
To speed things up, I decided to read in the daily files in parallel using the foreach
and doMC
packages, and I also write the station files to disk after aggregating by stations in parallel.
More specifically, I used foreach
to read in the files and combined them using .combine="rbind"
(I have enough memory to make one huge data set in memory). Afterwards, I have another foreach
loop where I subset the data by station and then write to disk. I found that I experienced a very good speed boost by doing the read/write in parallel.
My question is: Is it a good idea to read/write in parallel? I made sure that different threads do not read the same data file or write to the same station file, but after some googling, it seems that it may not be a good idea to parallelize i/o tasks? (Example I found saying no to parallel Input/Output, and a post on R-bloggers showing parallel reads)
You say you notice performance improvements, so then it's obviously a good idea for you.
Additional ways to speed things up - check out
fread
indata.table
- that will speed up sequential reading significantly (by a factor of 3 or more). Usingrbindlist
(also fromdata.table
) to combine should also provide speed ups (example usage withforeach
here - R foreach with .combine=rbindlist).Performance Pro
Performance Con
When reading from disk, CPU performance is typically not your bottleneck. Files on disk are, more often than not, written in as many sequential blocks as possible. This means that the pointer on your spinning disk does not have to move as far to read the next segment. If you perform the task in parallel, the pointer has to move repeatedly to pick up wherever it left off. This means that your disk write speed will ultimately be slower*.
*Solid-state drives may not have this problem (I don't know much about SSD's, but I imagine they aren't impacted at all by context switching).