Predict.lm in R fails to recognize newdata

2019-01-20 19:47发布

问题:

I'm running a linear regression where the predictor is categorized by another value and am having trouble generating modeled responses for newdata.

First, I generate some random values for the predictor and the error terms. I then construct the response. Note that the predictor's coefficient depends on the value of a categorical variable. I compose a design matrix based on the predictor and its category.

set.seed(1)

category = c(rep("red", 5), rep("blue",5))
x1 = rnorm(10, mean = 1, sd = 1)
err = rnorm(10, mean = 0, sd = 1)

y = ifelse(category == "red", x1 * 2, x1 * 3)
y = y + err

df = data.frame(x1 = x1, category = category)

dm = as.data.frame(model.matrix(~ category + 0, data = df))
dm = dm * df$x1

fit = lm(y ~ as.matrix(dm) + 0, data = df)

# This line will not produce a warning
predictOne = predict.lm(fit, newdata = dm)

# This line WILL produce a warning
predictTwo = predict.lm(fit, newdata = dm[1:5,])

The warning is:

'newdata' had 5 rows but variable(s) found have 10 rows

Unless I'm very much mistaken, I shouldn't have any issues with the variable names. (There are one or two discussions on this board which suggest that issue.) Note that the first prediction runs fine, but the second does not. The only change is that the second prediction uses only the first five rows of the design matrix.

Thoughts?

回答1:

I'm not 100% sure what you're trying to do, but I think a short walk-through of how formulas work will clear things up for you.

The basic idea is very simple: you pass two things, a formula and a data frame. The terms in the formula should all be names of variables in your data frame.

Now, you can get lm to work without following that guideline exactly, but you're just asking for things to go wrong. So stop and look at your model specifications and think about where R is looking for things.

When you call lm basically none of the names in your formula are actually found in the data frame df. So I suspect that df isn't being used at all.

Then if you call model.frame(fit) you'll see what R thinks your variables should be called. Notice anything strange?

model.frame(fit)
            y as.matrix(dm).categoryblue as.matrix(dm).categoryred
1   2.2588735                  0.0000000                 0.3735462
2   2.7571299                  0.0000000                 1.1836433
3  -0.2924978                  0.0000000                 0.1643714
4   2.9758617                  0.0000000                 2.5952808
5   3.7839465                  0.0000000                 1.3295078
6   0.4936612                  0.1795316                 0.0000000
7   4.4460969                  1.4874291                 0.0000000
8   6.1588103                  1.7383247                 0.0000000
9   5.5485653                  1.5757814                 0.0000000
10  2.6777362                  0.6946116                 0.0000000

Is there anything called as.matrix(dm).categoryblue in dm? Yeah, I didn't think so.

I suspect (but am not sure) that you meant to do something more like this:

df$y <- y
fit <- lm(y~category - 1,data = df)


回答2:

Joran is on the right track. The issue relates to column names. What I had wanted to do was create my own design matrix, something which, as it happens, I didn't need to do. If run the model with the following line of code, it's smooth sailing:

fit = lm(y ~ x1:category + 0, data = df)

That formula designation will replace the manual construction of the design matrix.

Using my own design matrix is something I had done in the past and the fit parameters and diagnostics were just as they ought to have been. I'd not used the predict function, so had never known that R was discarding the "data = " parameter. A warning would have been cool. R is a harsh mistress.



标签: r lm predict