I am using randomForest
function from randomForest package to find the most important variable:
my dataframe is called urban and my response variable is revenue which is numeric.
urban.random.forest <- randomForest(revenue ~ .,y=urban$revenue, data = urban, ntree=500, keep.forest=FALSE,importance=TRUE,na.action = na.omit)
I get the following error:
Error in randomForest.default(m, y, ...) : data (x) has 0 rows
on the source code it is related to x
variable:
n <- nrow(x)
p <- ncol(x)
if (n == 0)
stop("data (x) has 0 rows")
but I cannot understand what is x
.
I solved that. I had some columns that all their values were NA or the same. I dropped them and it went OK. my columns classes were character, numeric and factor.
candidatesnodata.index <- c()
for (j in (1 : ncol(dataframe))) {
if ( is.numeric(dataframe[ ,j]) & length(unique(as.numeric(dataframe[ ,j]))) == 1 )
{candidatesnodata.index <- append(candidatesnodata.index,j)}
}
dataframe <- dataframe[ , - candidatesnodata.index]
I have had a similar problem and it stemmed from the fact that I was passing in a string version of the call
y ~ x1 + .... xn
to the formula argument of the randomForest call. The simple fix was to cast the input to as.Formula().
I hope this saves anyone some time!