I have a very large matrix I'm trying to run through glmnet on a server with plenty of memory. It works fine even on very large data sets up to a certain point, after which I get the following error:
Error in elnet(x, ...) : long vectors (argument 5) are not supported in .C
If I understand correctly this is caused by a limitation in R which cannot have any vector with length longer than INT_MAX. Is that correct? Are there any available solutions to this that don't require a complete rewrite of glmnet? Do any of the alternative R interpreters (Riposte, etc) address this limitation?
Thanks!
Since version 3 R supports long vectors. A long vector is indexed by double
. A long vector can be a base for a matrix or a more-than-2 dimensional array as long as each dimension is small enough to be indexable by an integer
. Long vectors cannot be passed to native code via .C
and .Fortran
. The error message you are getting is because a long vector is being passed via .C
.
Long vectors can be passed via .Call
. So, as long as the native code of glmnet could support long vectors (64 bit indexes) or could be modified/compiled to support it, one only would have to modify the interface between R and native code of glmnet. You can do this manually in C and there is also a new package named dotCall64
for this task. Part of modifying the interface is deciding when to copy arguments - .C/.Fortran preventively copies, but you don't want to do this unnecessarily with large data structures.
I think the difficulty of changing the native code of glmnet to support 64 bit indexes depends on the actual code (that I only looked at but never worked with). It is easy to switch all integers (or explicitly or implicitly 32-bit integers) in Fortran code to 64-bit. The troubles come when some integers have to stay 32 bit, and this will happen e.g. for integer vectors passed from/to R code, because R uses 32 bit integers (even in long vectors indeed). There are such integer vectors passed in glmnet. How hard is the modification then depends on how clean is the original Fortran code (e.g. if it uses separate integer variables for indexing and accessing values of integer arrays, etc).
Experimental implementations of subsets of R, like Riposte, will not help.
There is a note in ?"long vector"
which states:
However, compiled code typically needs quite extensive changes. Note
that the .C and .Fortran interfaces do not accept long vectors, so
.Call (or similar) has to be used.
elnet
makes .Fortran
calls. You would have to modify the function to use .Call
, perhaps via a C wrapper that calls the FORTRAN code, and possibly rewrite and compile the relevant FORTRAN code to deal with long vectors.