This question already has an answer here:
-
Predict() - Maybe I'm not understanding it
4 answers
I loaded the inbuilt R data 'women' which has a tabular data of average American women's height and corresponding weight. this table has 15 rows. Using this data I am trying to predict the weight for specific values of height. I made a linear model first and gave new values to predict. But R still comes up with the 15 figures from the original data.
I am a beginner in regression so please tell me if I am doing anything wrong here.
data()
> women<-data.frame(women)
> names(women)
[1] "height" "weight"
> plot(women$weight~women$height)
> model<-lm(women$weight~women$height,data=women)
> new<-data.frame(height=c(82,83,84,85))
> wgt.prediction<-predict(model,new)
Warning message:
'newdata' had 4 rows but variables found have 15 rows
> wgt.prediction
1 2 3 4 5 6 7 8 9 10 11 12 13
112.5833 116.0333 119.4833 122.9333 126.3833 129.8333 133.2833 136.7333 140.1833 143.6333 147.0833 150.5333 153.9833
14 15
157.4333 160.8833
Note that extrapolating predictions outside the range of the original data can give poor answers; however, ignoring that try the following.
First, it is not necessary to use data()
or data.frame
. women
will be available to you anyways and it is already a data frame.
Also, the model's independent variable was specified in the question as women$height
but the prediction specified it as height
. It does not know that women$height
and height
are the same.
Replace all your code with this:
fo <- weight ~ height
model <- lm(fo, women)
heigths <- c(82, 83, 84, 85)
weights <- predict(model, data.frame(height = heights))
giving:
> weights
1 2 3 4
195.3833 198.8333 202.2833 205.7333
To plot the data with the predictions (i.e. with weights
) and the regression line defined by model
(continued after graph):
plot(fo, women, xlim = range(c(height, heights)), ylim = range(c(weight, weights)))
points(weights ~ heights, col = "red", pch = 20)
abline(model)
Although normally one uses predict
, given the problem introduced by using $ in the formula, an alternative using your original formulation would be to calculate the predictions like this:
model0 <- lm(women$weight ~ women$height)
cbind(1, 82:85) %*% coef(model0)
giving:
[,1]
[1,] 195.3833
[2,] 198.8333
[3,] 202.2833
[4,] 205.7333
# example dataset
dt = data.frame(mtcars)
# build 2 models
m1 = lm(mpg ~ wt, data = dt)
m2 = lm(dt$mpg ~ dt$wt, data = dt)
# new data (to predict)
dt_new = data.frame(wt = c(3.1, 3.5, 4.2))
# check if predictions work
predict(m1, dt_new)
predict(m2, dt_new)
The first predict
will work as the model's dependent variable is wt
and the new data have the variable wt
as well.
The second predict
will not work because the model's dependent variable is dt$wt
so every time the model will go back to dt
to get the variable wt
. In fact, no matter what your new dataset looks like, the model will try to predict using dt$wt
.