Trouble with predict function in R [duplicate]

2019-09-13 23:29发布

This question already has an answer here:

I loaded the inbuilt R data 'women' which has a tabular data of average American women's height and corresponding weight. this table has 15 rows. Using this data I am trying to predict the weight for specific values of height. I made a linear model first and gave new values to predict. But R still comes up with the 15 figures from the original data.

I am a beginner in regression so please tell me if I am doing anything wrong here.

 data()
> women<-data.frame(women)
> names(women)
[1] "height" "weight"
> plot(women$weight~women$height)
> model<-lm(women$weight~women$height,data=women)
> new<-data.frame(height=c(82,83,84,85))
> wgt.prediction<-predict(model,new)
Warning message:
'newdata' had 4 rows but variables found have 15 rows 
 > wgt.prediction
   1        2        3        4        5        6        7        8        9          10       11       12       13 
112.5833 116.0333 119.4833 122.9333 126.3833 129.8333 133.2833 136.7333 140.1833 143.6333 147.0833 150.5333 153.9833 
  14       15 
157.4333 160.8833 

2条回答
来,给爷笑一个
2楼-- · 2019-09-14 00:07
# example dataset
dt = data.frame(mtcars)

# build 2 models
m1 = lm(mpg ~ wt, data = dt)
m2 = lm(dt$mpg ~ dt$wt, data = dt)

# new data (to predict)
dt_new = data.frame(wt = c(3.1, 3.5, 4.2))

# check if predictions work
predict(m1, dt_new)
predict(m2, dt_new)

The first predict will work as the model's dependent variable is wt and the new data have the variable wt as well.

The second predict will not work because the model's dependent variable is dt$wt so every time the model will go back to dt to get the variable wt. In fact, no matter what your new dataset looks like, the model will try to predict using dt$wt.

查看更多
迷人小祖宗
3楼-- · 2019-09-14 00:16

Note that extrapolating predictions outside the range of the original data can give poor answers; however, ignoring that try the following.

First, it is not necessary to use data() or data.frame. women will be available to you anyways and it is already a data frame.

Also, the model's independent variable was specified in the question as women$height but the prediction specified it as height. It does not know that women$height and height are the same.

Replace all your code with this:

fo <- weight ~ height
model <- lm(fo, women)
heigths <- c(82, 83, 84, 85)
weights <- predict(model, data.frame(height = heights))

giving:

> weights
       1        2        3        4 
195.3833 198.8333 202.2833 205.7333 

To plot the data with the predictions (i.e. with weights) and the regression line defined by model (continued after graph):

plot(fo, women, xlim = range(c(height, heights)), ylim = range(c(weight, weights)))
points(weights ~ heights, col = "red", pch = 20)
abline(model)

screenshot

Although normally one uses predict, given the problem introduced by using $ in the formula, an alternative using your original formulation would be to calculate the predictions like this:

model0 <- lm(women$weight ~ women$height)
cbind(1, 82:85) %*% coef(model0)

giving:

         [,1]
[1,] 195.3833
[2,] 198.8333
[3,] 202.2833
[4,] 205.7333
查看更多
登录 后发表回答