RStudio was crashing when I tried to reshape a particular data frame using dcast
(from the reshape2
package). I discovered that the crash was actually happening in R itself, so I ran my casting code in R.app and got the type of error that gives this site its name: Error: segfault from C stack overflow
. With the help of Google and SO, I learned that this is a memory access error.
Okay, I got that far, but I don't know where to go from here. I can't provide a true reproducible example, because my data frame is about 558,000 rows and the problem doesn't occur on small toy examples. For example, even if I take, say, a 50,000-row subset of the data, dcast
works just fine. Could there be a particular row of data that's causing a problem? If so, can anyone suggest what feature(s) to look for that could be causing the type of error I'm getting?
Here is a subset of the data frame I'm casting from (with fake values for some variables), followed by the casting function I'm using. I've also included this small snippet of data in a dput
function below, in case it would be helpful to play around with it. The real data set has about 700 values of prog
, 15 values of prog1
, and 5 values of fa.type
.
id term yr nslds acad.lev prog prog1 fa.type amount
1 1 Fall 2009 2010 Graduate Graduate loan 1 Other Loans Loan 5000
2 1 Spring 2010 2010 Graduate Graduate loan 1 Other Loans Loan 5000
3 2 Fall 2009 2010 Graduate Graduate loan 2 Stafford Loan Loan 8781
4 2 Spring 2010 2010 Graduate Graduate loan 2 Stafford Loan Loan 8781
5 3 Fall 2007 2008 Graduate Graduate loan 3 Stafford Loan Loan 4250
6 3 Fall 2007 2008 Graduate Graduate grant 1 University Grant Grant 1707
fa.wide = dcast(id + term + yr + nslds + acad.lev ~ prog1 + fa.type , data=fa, value.var="amount", fun.aggregate=sum)
fa = structure(list(id = c(1, 1, 2, 2, 3, 3), term = structure(c(7L,
8L, 7L, 8L, 1L, 1L), .Label = c("Fall 2007", "Spring 2008", "Summer 2008",
"Fall 2008", "Spring 2009", "Summer 2009", "Fall 2009", "Spring 2010",
"Summer 2010", "Fall 2010", "Spring 2011", "Summer 2011", "Fall 2011",
"Spring 2012", "Summer 2012", "Fall 2012", "Spring 2013"), class = c("ordered",
"factor")), yr = c(2010L, 2010L, 2010L, 2010L, 2008L, 2008L),
nslds = structure(c(7L, 7L, 7L, 7L, 7L, 7L), .Label = c("1st Year, Never Attended",
"1st Year, Previously Attended", "2nd Year", "3rd Year",
"4th Year", "5th Year+", "Graduate"), class = c("ordered",
"factor")), acad.lev = structure(c(6L, 6L, 6L, 6L, 6L, 6L
), .Label = c("Freshman", "Sophomore", "Junior", "Senior",
"PB Undergrad", "Graduate"), class = c("ordered", "factor"
)), prog = c("loan 1", "loan 1", "loan 2", "loan 2", "loan 3",
"grant 1"), prog1 = c("Other Loans", "Other Loans", "Stafford Loan",
"Stafford Loan", "Stafford Loan", "University Grant"), fa.type = structure(c(3L,
3L, 3L, 3L, 3L, 2L), .Label = c("Athletic", "Grant", "Loan",
"Scholarship", "Waiver", "Work/Study"), class = "factor"),
amount = c(5000, 5000, 8781, 8781, 4250, 1707)), .Names = c("id",
"term", "yr", "nslds", "acad.lev", "prog", "prog1", "fa.type",
"amount"), row.names = c(NA, 6L), class = "data.frame")
This isn't an answer, but a simple (non-sensical) reproducible example that wouldn't fit in the comments. You can recreate this error with this simple example (on my MacBookPro).
The error occurs at the boundary
n = 1448
, i.e. it doesn't occur whenn=1447
and below. It seems that the error is coming fromsplit_indices
insplit-numeric.c
from the packageplyr
. It could have to do with the fact that the number of grouping levels is assigned to an (unsigned?) integer value, and if the number of groups goes over 32767 it causes a memory access error, but TBH I'm clutching at straws now.My
sessionInfo()
in case anyone can't recreate this error is:Interestingly, if I run the
df2 <-
command again after getting the first error, R crashes out completely and I get some OS generated error report. I include the relevant portion of the crash log here:I'm having a same problem in pivoting a long table to wide one using dcast in package reshape2. I found solution in this post plyr split_indices function crashes for long vectors. Specifically, you could download the split_numeric.c and loop-apply.c in this page https://github.com/hadley/plyr/tree/master/src. Uninstall the package plyr from R console, and finally reinstall the package locally: install.packages('/path/to/source', repos=NULL, type='source').
This solves my problem, hope it helps.