Running the Random Forest example from http://www.kaggle.com/c/icdar2013-gender-prediction-from-handwriting/data, the following line:
forest_model <- randomForest(as.factor(male) ~ ., data=train, ntree=10000)
takes hours (not sure whether it will ever end, but the process does seems to work) .
The data set has 1128 rows and ~7000 variables.
Is it possible to estimate when the Random Forest training will finish? Can I profile R somehow to get more information?
Found the problem, using formula in randomForest has created a tremendous performance degradation.
More on this and how to estimate random forest running time can found in: https://stats.stackexchange.com/questions/37370/random-forest-computing-time-in-r and in http://www.gregorypark.org/?p=286
Here is final code:
One idea, to control the convergence is to use the
do.trace
for a verbose mode