Following my answered question: R or Python - loop the test data - Prediction validation next 24 hours (96 values each day)
I want to predict the next day using H2o Package. You can find detail explanation for my dataset in the same above link.
The data dimension in H2o is different.
So, after making the prediction, I want to calculate the MAPE
I have to change training and testing data to H2o format
train_h2o <- as.h2o(train_data)
test_h2o <- as.h2o(test_data)
mape_calc <- function(sub_df) {
pred <- predict.glm(glm_model, sub_df)
actual <- sub_df$Ptot
mape <- 100 * mean(abs((actual - pred)/actual))
new_df <- data.frame(date = sub_df$date[[1]], mape = mape)
return(new_df)
}
# LIST OF ONE-ROW DATAFRAMES
df_list <- by(test_data, test_data$date, map_calc)
# FINAL DATAFRAME
final_df <- do.call(rbind, df_list)
The upper code works well for "Non-H2o" prediction validation for the day-ahead and it calculates the MAPE for every day.
I tried to convert the H2o predicted model to normal format but according to to:https://stackoverflow.com/a/39221269/9341589, it is not possible.
To make a prediction in H2O:
for instance, let say we want to create a Random Forest Model
y <- "RealPtot" #target
x <- names(train_h2o) %>% setdiff(y) #features
rforest.model <- h2o.randomForest(y=y, x=x, training_frame = train_h2o, ntrees = 2000, mtries = 3, max_depth = 4, seed = 1122)
Then we can get the prediction for complete dataset as shown below.
predict.rforest <- as.data.frame(h2o.predict(rforest.model, test_h2o)
But in my case I am trying to get one-day prediction using mape_calc
NOTE: Any thoughts in R or Python will be appreciated.
UPDATE2(reproducible example):** Following @Darren Cook steps:
I provided a simpler example - Boston housing dataset.
library(tidyverse)
library(h2o)
h2o.init(ip="localhost",port=54322,max_mem_size = "128g")
data(Boston, package = "MASS")
names(Boston)
[1] "crim" "zn" "indus" "chas" "nox" "rm" "age" "dis" "rad" "tax" "ptratio"
[12] "black" "lstat" "medv"
set.seed(4984)
#Added 15 minute Time and date interval
Boston$date<- seq(as.POSIXct("01-09-2017 03:00", format = "%d-%m-%Y %H:%M",tz=""), by = "15 min", length = 506)
#select first 333 values to be trained and the rest to be test data
train = Boston[1:333,]
test = Boston[334:506,]
#Dropped the date and time
train_data_finialized <- subset(train, select=-c(date))
test_data_finialized <- test
#Converted the dataset to h2o object.
train_h2o<- as.h2o(train_data_finialized)
#test_h2o<- as.h2o(test)
#Select the target and feature variables for h2o model
y <- "medv" #target
x <- names(train_data_finialized) %>% setdiff(y) #feature variables
# Number of CV folds (to generate level-one data for stacking)
nfolds <- 5
#Replaced RF model by GBM because GBM run faster
# Train & Cross-validate a GBM
my_gbm <- h2o.gbm(x = x,
y = y,
training_frame = train_h2o,
nfolds = nfolds,
fold_assignment = "Modulo",
keep_cross_validation_predictions = TRUE,
seed = 1)
mape_calc <- function(sub_df) {
p <- h2o.predict(my_gbm, as.h2o(sub_df))
pred <- as.vector(p)
actual <- sub_df$medv
mape <- 100 * mean(abs((actual - pred)/actual))
new_df <- data.frame(date = sub_df$date[[1]], mape = mape)
return(new_df)
}
# LIST OF ONE-ROW DATAFRAMES
df_list <- by(test_data_finialized, test_data_finialized$date, mape_calc)
final_df <- do.call(rbind, df_list)
This is the error I am getting now:
Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page, :
ERROR MESSAGE:
Provided column type POSIXct is unknown. Cannot proceed with parse due to invalid argument.
H2O is running in a separate process to R (whether H2O is on the local server or in a distant data centre). The H2O data and the H2O models are kept in that H2O process, and cannot be seen by R.
What
dH <- as.h2o(dR)
does is copy an R data frame,dR
, into H2O's memory space. ThedH
is then an R variable that describes the H2O data frame. I.e. it is a pointer, or a handle; it is not the data itself.What
dR <- as.data.frame(dH)
does is copy the data from the H2O process's memory, into the R process's memory. (as.vector(dH)
does the same for when dH describes a single column)So, the simplest way to modify your
mape_calc()
, assuming thatsub_df
is an R data frame, is to change the first two lines as follows:I.e. upload
sub_df
to H2O, and give that toh2o.predict()
. Then useas.vector()
to download the prediction that was made.This was relative to your original code. So keep the original version of this:
I.e. don't use
by()
directly ontest_h2o
.UPDATE based on edited question:
I made two changes to your example code. First, I removed the date column from
sub_df
. That was what was causing the error message.The second change was just to simplify the return type; not important, but you ended up with the date column duplicated, before.
ASIDE:
h2o.predict()
is most efficient when working on a batch of data to make predictions on. Puttingh2o.predict()
inside a loop is a code smell. You would be better to callh2o.predict(rforest.model, test_h2o)
once, outside the loop, then download the predictions into R, andcbind
them to test_data, and then useby
on that combined data.UPDATE Here is your example changed to work that way: (I've added the prediction as an extra column to the test data; there are other ways to do it, of course)
You should notice that it runs much quicker.
ADDITIONAL UPDATE:
by()
works by grouping same values of your 2nd argument, and processing them together. As all your timestamps are different, you are processing one row at a time.Look into the
xts
library, and e.g.apply.daily()
to group timestamps. But for the simple case of wanting to process by date, there is a simple hack. Change yourby()
line to:Using
as.Date()
will strip off the times. Therefore all the rows on the same day now look the same and get processed together.ASIDE 2: You would get better responses if your make the infamous minimal example. Then people can run your code, and they can test their answers. It is also often better to use a simple data set everyone has, e.g. iris, rather than your own data. (You can do regression on any of the first 4 fields; using iris does not have to always be about predicting the species.)
ASIDE 3: You can do MAPE completely inside H2O, as the
abs()
andmean()
functions will work directly on H2O data frames (as do lots of other things - see the H2O manual): https://stackoverflow.com/a/43103229/841830 (I'm not marking this as a duplicate, as your question was how to adaptby()
for use with H2O data frames, not how to calculate MAPE efficiently!)It looks like you are mixing up R and H2O data types. Remember H2O's R is simply an R API and is not the same as native R. This means that you can't apply an R function that expects an R dataframe to an H2OFrame. And likewise you can't apply an H2O Function to an R dataframe when it expects an H2OFrame.
As you can see from the R docs on
by
it's a function that expects "an R object, normally a data frame, possibly a matrix" so you can't pass in an H2O frame.Similarly you are passing
date = H2OFrame
todata.frame()
.However you can use the
as.data.frame()
to convert an H2OFrame to an R dataframe and then go about your calculations entirely in R.