I've read other postings regarding named variables and tried implementing the answers but still get too many values for my new data that I want to run my existing model on. Here is working example code:
set.seed(123)
mydata <- data.frame("y"=rnorm(100,mean=0, sd = 1),"x"=c(1:100))
mylm <- lm(y ~ x, data=mydata)
# ok so mylm is a model on 100 points - lets look at it and the data
par(mfrow=c(2,2))
plot(mylm)
par(mfrow=c(1,1))
predvals <- predict(mylm, data=mydata)
plot(mydata$x,mydata$y)
lines(predvals)
No surprises here - a straight line through generated points - both 100 observations in length. Now I generate 20 points of new data with the exact same names and when I run the new data through predict() I expect to get 20 points and instead I get 100. What am I missing! Driving me crazy....
newdata <- data.frame("y"=rnorm(20,mean=0, sd = 1), "x"=c(1:20))
predvals <- predict(mylm, data=newdata)
length(newdata$y)
length(predvals)
# quick -not elegant - way to look at it:
plot(predvals)
lines(newdata$x,newdata$y)
Do I need to tell predict() to only use 20 points or something like that?
Your issue is in
predvals <- predict(mylm, data=newdata)
.The correct call is
predict(mylm, newdata=newdata)
. Thepredict()
function in R takes a named argumentnewdata
, notdata
.