R - Linear Regression - Control for a variable

2020-08-09 04:49发布

问题:

I have a computer science background & I am trying to teach myself data science by solving the problems available on the internet

I have a smallish data set which has 3 variables - race, gender and annual income. There are about 10,000 sample observations. I am trying to predict income from race & gender.

I have divided the data into 2 parts - one for each gender & now I am trying to create 2 regression models. Is this possible in R? Can some one provide example syntax.

回答1:

You don't specify how your data are stored or how the variable race is recorded (is it a factor?)

[If you're just fitting income against race for males, say, and you had the male income and race in income.m and race.m and if the second was a factor in R, then lm(income.m~race.m) will fit the line for males (use summary on the resulting object to get information about it). You could do something similar for females. But most people won't fit the models this way.]

If you're prepared to assume that the variation about the lines is the same for both genders, you can fit both lines with one model.

This has several advantages over analyzing the lines separately, though that can also be done.

If gender is either a factor or a numeric variable recorded as (0/1), and race is a factor and you have the data in a data frame (called, for example, incdata), then you'd fit both lines at once with:

lm(income~race*gender, data=incdata)

which is R shorthand for

lm(income~race+gender+race:gender, data=incdata)

where race:gender is an interaction term.

If you further assume that the effect of race is the same for both sexes, then the smaller model:

lm(income~race+gender, data=incdata)

would be used instead. This would often be the model people would fit if asked to 'control for gender', though many would consider the interaction model I mentioned before instead.

I'd strongly advise working on more simple regression problems first, with a textbook or set of notes suitable for guiding you through the ideas.


If you haven't already fitted a regression in R, I'd start with a smaller data set that only has a single predictor just to get used to the basic mechanics.

R comes with many data sets already built in. See, for example, library(help=datasets) which has about 80 data sets; some of the packages that come with R have more (MASS has over 80, for example). Many R packages on CRAN are packed with data sets, many suitable for regression.

For example, the cars data set (see ?cars in R) records the stopping distance of cars, given their speed. You don't need to read the data in, it's already there.

A simple linear regression (not necessarily the best model given some understanding of physics, but just about adequate for the data) would be:

lm(dist~speed, cars)

Again, you use summary to examine it. e.g. (I suggest you type these one at a time):

carsfit <- lm(dist~speed, cars) summary(carsfit) plot(dist~speed, cars) abline(carsfit, col=2)

The examples in the help on the cars data set (?cars) gives several other models and plots. You might try those one at a time also.

The car package (CAR is short for "Companion to Applied Regression") has many small data sets specifically for regression.



回答2:

It is very simple.

fit1 <- lm(income~gender+race,data=Dataframe1)
summary(fit1)

I would not recommend using two dataframes. Unless you are using more advanced statistical methods that require using two dataframes. Just use your gender variable.

Also, check this site out: http://www.statmethods.net/stats/regression.html



回答3:

You could indeed do so Abhi but I believe your question is very broad.

(1) you could predict income from race and gender. This can be done in various ways but the most common would perhaps be "regression analysis". I suggest you do some searches on the internet on that topic. Answering what kind of regression and how to perform it is a matter of situation. You would probably find out yourself after reading about regression.

(2) R can do that. But i suggest you do some reading about regression before you get into R.

(3) If I were to analyze if race and gender can predict income I would simply do a linear regression where income would be the dependent variable and race and sex would be independent (predictors). This can be done by the "lm" function in R.

Or did I misunderstand something here?

Regards



回答4:

You need to do some reading on Linear/Multiple Regression techniques. Not sure why you divide data into 2 groups based on gender. Random split the data into Train and Test, so that you can model on Train and Validate on test.