R: Is it possible to parallelize / speed-up the re

Once the CSV is loaded via read.csv, it's fairly trivial to use multicore, segue etc to play around with the data in the CSV. Reading it in, however, is quite the time sink.

Realise it's better to use mySQL etc etc.

Assume the use of an AWS 8xl cluster compute instance running R2.13

Specs as follows:

Cluster Compute Eight Extra Large specifications:
88 EC2 Compute Units (Eight-core 2 x Intel Xeon)
60.5 GB of memory
3370 GB of instance storage
64-bit platform
I/O Performance: Very High (10 Gigabit Ethernet)

Any thoughts / ideas much appreciated.

标签： r csv parallel-processing bigdata

3条回答

地球回转人心会变

2楼-- · 2019-02-16 19:59

What you could do is use scan. Two of its input arguments could prove to be interesting: n and skip. You just open two or more connections to the file and use skip and n to select the part you want to read from the file. There are some caveats:

At some stage disk i/o might prove the bottle neck.
I hope that scan does not complain when opening multiple connections to the same file.

But you could give it a try and see if it gives a boost to your speed.

0人赞添加讨论(0) 举报

男人必须洒脱

3楼-- · 2019-02-16 20:00

Going parallel might not be needed if you use fread in data.table.

library(data.table)
dt <- fread("myFile.csv")

A comment to this question illustrates its power. Also here's an example from my own experience:

d1 <- fread('Tr1PointData_ByTime_new.csv')
Read 1048575 rows and 5 (of 5) columns from 0.043 GB file in 00:00:09

I was able to read in 1.04 million rows in under 10s!

0人赞添加讨论(0) 举报

做个烂人

4楼-- · 2019-02-16 20:14

Flash or conventional HD storage? If the latter, then if you don't know where the file is on the drives, and how it's split, it's very hard to speed things up because multiple simultaneous reads will not be faster than one streamed read. It's because of the disk, not the CPU. There's no way to parallelize this without starting at the storage level of the file.

If it's flash storage then a solution like Paul Hiemstra's might help since good flash storage can have excellent random read performance, close to sequential. Try it... but if it's not helping you know why.

Also... a fast storage interface doesn't necessary mean the drives can saturate it. Have you run performance testing on the drives to see how fast they really are?

0人赞添加讨论(0) 举报

R: Is it possible to parallelize / speed-up the re

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间