Interpreting estimates of categorical predictors i

2019-06-08 19:49发布

问题:

This question already has an answer here:

  • Interpreting interactions in a regression model 2 answers

I'm new to linear regression and I'm trying to figure out how to interpret the summary results. I'm having difficulty interpreting the estimates of categorical predictors. Consider the following example. I added the columns age and length to include a numeric predictor and numeric target.

library(MASS)
data <- as.data.frame(HairEyeColor)

data$length <- c(155, 173, 172, 176, 186, 188, 160, 154, 192, 192, 185, 150, 181, 195, 161, 194,
173, 185, 185, 195, 168, 158, 151, 170, 163, 156, 186, 173, 167, 172, 164, 182)
data$age <- c(48, 44, 8, 23, 23, 63, 64, 26, 8, 56, 40, 11, 17, 12, 60, 10, 9, 21, 46, 7, 12, 9, 32, 37, 52, 64, 36, 31, 41, 24)

summary(lm(length ~ Hair + Eye + Sex + age, data))

Output:

         Estimate Std. Error t value Pr(>|t|)    
(Intercept) 182.72906    8.22026  22.229   <2e-16 ***
HairBrown     6.22998    7.45423   0.836    0.412    
HairRed      -0.38261    7.50570  -0.051    0.960    
HairBlond    -0.25860    7.36012  -0.035    0.972    
EyeBlue      -8.44369    7.36646  -1.146    0.263    
EyeHazel      0.06968    7.49589   0.009    0.993    
EyeGreen     -0.15554    7.27704  -0.021    0.983    
SexFemale    -4.92415    5.18308  -0.950    0.352    
age          -0.19084    0.15910  -1.200    0.243

Most of these aren't significant, but let's ignore that for now.

  1. What is there to say about (Intercept)? Intuitively, I'd say this is the value for length when the baseline values for the categorical predictors (Hair = Black, Eye = Brown, Sex = Male) apply, and when age = 0. Is this correct?

  2. The mean value of length in the dataset is 173.8125, yet the estimate is 182.72906. Does that imply that for the baseline situation, the estimation for length is actually higher than the average length?

  3. A similar question as question 2: Let's say Eye = Blue, and all other values remain as the baseline. The estimate then becomes 174.284 (182.72906 - 8.44369). Can I infer from this that the expected average length is then 174.284 and thus still higher than the overall average (173.8125)?

  4. How can I discover which predictor/value has a positive or negative effect on length? Simply taking the direction of the estimate won't work: A negative estimate only means it has a negative impact when compared to the baseline. Does this mean I can only infer that for example Eye = Blue has a negative impact when compared to Eye = Brown, rather than to infer that it has a negative impact in general?

  5. How come (Intercept) is significant while all other rows aren't? What does the significance of the intercept stand for?

  6. When running the model with only Hair as a predictor, the direction of Hair = Blond becomes positive (see below), while it is negative in the previous model. Is it then wiser to run the model separately for each predictor so that I can capture the true size and direction of an individual predictor?

        summary(lm(length ~ Hair, data))
    
    
        Estimate Std. Error t value Pr(>|t|)    
    
        (Intercept)  173.125      5.107  33.900   <2e-16 ***
        HairBrown      4.250      7.222   0.588    0.561    
        HairRed       -2.625      7.222  -0.363    0.719    
        HairBlond      1.125      7.222   0.156    0.877  
    

Thank you for your help.

回答1:

Taking these pointwise:

1) Yes, your interpretation is correct. HairBrown = 6 means that the length is 6 units longer for brown-haired individuals than for the baseline category. In this case that is black-haired, but it's worth noting that the choice of baseline is arbitrary for categorical variables.

2) I would not really interpret the intercept value by itself in this manner, because: A) remember that you also have a continuous predictor (age) in there which you are not incorporating into this notion; there is nobody at age = 0, so you are estimating a value for an individual that does not (or cannot, rather) occur in your dataset. B) you have several explanatory variables and so 'baseline situation' is lumping things together which need not be lumped. You have the information about what each variable is doing and can combine them to predict the value for any particular combination of age, eye colour, sex and hair colour.

3) You could in some cases, but you're talking about somebody with age = 0 in your example. Even otherwise, I don't see why you are trying to compare against an average situation (for reasons explained in the previous case). Additionally, ignoring the continuous predictor for the moment, differences in sample size between groups can strongly affect the overall average. It is almost always more meaningful to compare groups with each other than to compare individual groups against an overall average. Also note that this ignores uncertainty in the parameter estimates.

4) 'Has a negative impact in general' is not very meaningful. This is by necessity a comparison, i.e. negative relative to something. What you can do is make pairwise comparisons between other categories (not just the baseline) with the estimated coefficients because the relationships are transitive. E.g. Both EyeBlue and EyeGreen are negative relative to baseline, but EyeGreen is much more negative. So green-eyed individuals have shorter length (ignoring the fact that the variables are not significant)

5) Intercept being significant just means that your baseline has a length that is not equal to 0. In most cases this is not very informative, especially because (again!) this assumes an age of 0. This is a problem of extrapolation.

6) No, but this is not a very simple topic (look up model selection if you're interested in knowing more). In this case, none of your variables are significant, which (loosely speaking) means that you can't really say whether any variable has either a positive or negative effect. So it's not surprising that changes in model structure flip the sign. Look at the confidence intervals to see how broad the parameter estimates are; they will range from negative to positive. Basically, your variables probably don't explain much, assuming you have a decent sample size.

All comparisons here are much easier to think about with a figure (made using your parameter values above, and coloured by hair colour):



回答2:

  1. Yes. The dummy variables are created by contrast coding so your intercept is indeed the prediction for base values.

  2. Again as stated in point 1, Yes.

  3. Yes you can conclude that, but the difference is small. You should check if the average falls withing the confidence interval or not. If it does then the difference between average and the value for Blue isn't significant for practical purposes.

  4. Since these are all dummy variables you can infer that a positive estimate indicates positive impact and vice versa. However, to be more precise take a look at the confidence intervals. Only if both the upper and lower intervals are positive you can say with confidence that the variable has positive impact. Otherwise its unpredictable.

  5. Since your data doesn't provide any information to the model on what happens when all variables are zero, the model will has less observations to make any meaningful prediction about the intercept. Your dummy variables will never be all zero at any point.

  6. Yes you can do that, but it will mostly give you only the direction, provided the confidence intervals don't include zero between them.

If I were you I'd choose a different model like regression trees which are known to work well with categorical variables.