I have a computer science background & I am trying to teach myself data science by solving the problems available on the internet
I have a smallish data set which has 3 variables - race, gender and annual income. There are about 10,000 sample observations. I am trying to predict income from race & gender.
I have divided the data into 2 parts - one for each gender & now I am trying to create 2 regression models. Is this possible in R? Can some one provide example syntax.
You don't specify how your data are stored or how the variable race is recorded (is it a factor?)
[If you're just fitting income against race for males, say, and you had the male income and race in income.m
and race.m
and if the second was a factor in R, then lm(income.m~race.m)
will fit the line for males (use summary
on the resulting object to get information about it). You could do something similar for females. But most people won't fit the models this way.]
If you're prepared to assume that the variation about the lines is the same for both genders, you can fit both lines with one model.
This has several advantages over analyzing the lines separately, though that can also be done.
If gender is either a factor or a numeric variable recorded as (0/1), and race is a factor and you have the data in a data frame (called, for example, incdata
), then you'd fit both lines at once with:
lm(income~race*gender, data=incdata)
which is R shorthand for
lm(income~race+gender+race:gender, data=incdata)
where race:gender
is an interaction term.
If you further assume that the effect of race is the same for both sexes, then the smaller model:
lm(income~race+gender, data=incdata)
would be used instead. This would often be the model people would fit if asked to 'control for gender', though many would consider the interaction model I mentioned before instead.
I'd strongly advise working on more simple regression problems first, with a textbook or set of notes suitable for guiding you through the ideas.
If you haven't already fitted a regression in R, I'd start with a smaller data set that only has a single predictor just to get used to the basic mechanics.
R comes with many data sets already built in. See, for example, library(help=datasets)
which has about 80 data sets; some of the packages that come with R have more (MASS
has over 80, for example). Many R packages on CRAN are packed with data sets, many suitable for regression.
For example, the cars
data set (see ?cars
in R) records the stopping distance of cars, given their speed. You don't need to read the data in, it's already there.
A simple linear regression (not necessarily the best model given some understanding of physics, but just about adequate for the data) would be:
lm(dist~speed, cars)
Again, you use summary
to examine it. e.g. (I suggest you type these one at a time):
carsfit <- lm(dist~speed, cars)
summary(carsfit)
plot(dist~speed, cars)
abline(carsfit, col=2)
The examples in the help on the cars data set (?cars
) gives several other models and plots. You might try those one at a time also.
The car
package (CAR is short for "Companion to Applied Regression") has many small data sets specifically for regression.
It is very simple.
fit1 <- lm(income~gender+race,data=Dataframe1)
summary(fit1)
I would not recommend using two dataframes. Unless you are using more advanced statistical methods that require using two dataframes. Just use your gender variable.
Also, check this site out: http://www.statmethods.net/stats/regression.html
You could indeed do so Abhi but I believe your question is very broad.
(1) you could predict income from race and gender. This can be done in various ways but the most common would perhaps be "regression analysis". I suggest you do some searches on the internet on that topic. Answering what kind of regression and how to perform it is a matter of situation. You would probably find out yourself after reading about regression.
(2) R can do that. But i suggest you do some reading about regression before you get into R.
(3) If I were to analyze if race and gender can predict income I would simply do a linear regression where income would be the dependent variable and race and sex would be independent (predictors). This can be done by the "lm" function in R.
Or did I misunderstand something here?
Regards
You need to do some reading on Linear/Multiple Regression techniques. Not sure why you divide data into 2 groups based on gender. Random split the data into Train and Test, so that you can model on Train and Validate on test.