I'm running a linear regression where the predictor is categorized by another value and am having trouble generating modeled responses for newdata.
First, I generate some random values for the predictor and the error terms. I then construct the response. Note that the predictor's coefficient depends on the value of a categorical variable. I compose a design matrix based on the predictor and its category.
set.seed(1)
category = c(rep("red", 5), rep("blue",5))
x1 = rnorm(10, mean = 1, sd = 1)
err = rnorm(10, mean = 0, sd = 1)
y = ifelse(category == "red", x1 * 2, x1 * 3)
y = y + err
df = data.frame(x1 = x1, category = category)
dm = as.data.frame(model.matrix(~ category + 0, data = df))
dm = dm * df$x1
fit = lm(y ~ as.matrix(dm) + 0, data = df)
# This line will not produce a warning
predictOne = predict.lm(fit, newdata = dm)
# This line WILL produce a warning
predictTwo = predict.lm(fit, newdata = dm[1:5,])
The warning is:
'newdata' had 5 rows but variable(s) found have 10 rows
Unless I'm very much mistaken, I shouldn't have any issues with the variable names. (There are one or two discussions on this board which suggest that issue.) Note that the first prediction runs fine, but the second does not. The only change is that the second prediction uses only the first five rows of the design matrix.
Thoughts?
I'm not 100% sure what you're trying to do, but I think a short walk-through of how formulas work will clear things up for you.
The basic idea is very simple: you pass two things, a formula and a data frame. The terms in the formula should all be names of variables in your data frame.
Now, you can get
lm
to work without following that guideline exactly, but you're just asking for things to go wrong. So stop and look at your model specifications and think about where R is looking for things.When you call
lm
basically none of the names in your formula are actually found in the data framedf
. So I suspect thatdf
isn't being used at all.Then if you call
model.frame(fit)
you'll see what R thinks your variables should be called. Notice anything strange?Is there anything called
as.matrix(dm).categoryblue
indm
? Yeah, I didn't think so.I suspect (but am not sure) that you meant to do something more like this:
Joran is on the right track. The issue relates to column names. What I had wanted to do was create my own design matrix, something which, as it happens, I didn't need to do. If run the model with the following line of code, it's smooth sailing:
That formula designation will replace the manual construction of the design matrix.
Using my own design matrix is something I had done in the past and the fit parameters and diagnostics were just as they ought to have been. I'd not used the predict function, so had never known that R was discarding the "data = " parameter. A warning would have been cool. R is a harsh mistress.