I'm trying to work with a 1909x139352 dataset using R. Since my computer only has 2GB of RAM, the dataset turns out to be too big (500MB) for the conventional methods. So I decided to use the ff
package. However, I've been having some troubles. The function read.table.ffdf
is unable to read the first chunk of data. It crashes with the next error:
txtdata <- read.table.ffdf(file="/directory/myfile.csv",
FUN="read.table",
header=FALSE,
sep=",",
colClasses=c("factor",rep("integer",139351)),
first.rows=100, next.rows=100,
VERBOSE=TRUE)
read.table.ffdf 1..100 (100) csv-read=77.253sec
Error en ff(initdata = initdata, length = length, levels = levels, ordered = ordered, :
write error
Does anyone have any idea of what is going on?
This error message indicates that you have too many open files. In ff, every column in your ffdf is a file. You can only have a limited number of files open - and you have hit that number. See my reply on Any ideas on how to debug this FF error?.
So in your case, using simply read.table.ffdf won't work because you have 139352 columns. It is possible however to import it in ff but you need to be carefull when opening columns while getting data in RAM to avoid this issue.
Your data set really isn't that big..
It might help if you said something about what you're trying to do with it.
this might help: Increasing Available memory in R
or
if that doesn't work, the data.table package is VERY fast and doesn't hog memory when manipulating data.tables with the := operator.
and
as far as read.table.ffdf, check this out.. read.table.ffdf tutorial, if you read carefully, it gives hints and details about optimizing your memory usage with commands like gc() and more.
I recently encountered this problem with a data frame that had ~ 3,000 columns. The easiest way to get around this is to adjust the maximum number of files allowed open for your user account. The typical system is set to ~ 1024 and that is a very conservative limit. Do note that it is set to prevent resource exhaustion on the server.
On Linux:
Add the following to your /etc/security/limits.conf
file.
youruserid hard nofile 200000 # you may enter whatever number you wish here
youruserid soft nofile 200000 # whatever you want the default to be for each shell or process you have running
On OS X:
Add or edit the following in your /etc/sysctl.con
file.
kern.maxfilesperproc=200000
kern.maxfiles=200000
You'll need to log out and log back in but then the original poster would be able to use the ffdf to open his 139352 column data frame.
I've posted more about my run-in with this limit here.