I'm relatively new at R, so please bear with me.
I'm using the Ames dataset (full description of dataset here; link to dataset download here).
I'm trying to create a subset data frame that will allow me to run a linear regression analysis, and I'm trying to remove the outliers using the boxplot.stats
function. I created a frame that will include my samples using the following code:
regressionFrame <- data.frame(subset(ames_housing_data[,c('SalePrice','GrLivArea','LotArea')] , BldgType == '1Fam'))
My next objective was to remove the outliers, so I tried to subset using a which()
function:
regressionFrame <- regressionFrame[which(regressionFrame$GrLivArea != boxplot.stats(regressionFrame$GrLivArea)$out),]
Unfortunately, that produced the
longer object length is not a multiple of shorter object length
error. Does anyone know a better way to approach this, ideally using the which()
subsetting function? I'm assuming it would include some form of lapply()
, but for the life of me I can't figure out how. (I figure I can always learn fancier methods later, but this is the one I'm going for right now since I already understand it.)
Nice use with
boxplot.stats
.You can not test SAFELY using
!=
ifboxplot.stats
returns you more than one outliers in$out
. An analogy here is1:5 != 1:3
. You probably want to try!(1:5 %in% 1:3)
.What I mean by SAFELY, is that
1:5 != 1:3
gives a wrong result with a warning, but1:6 != 1:3
gives a wrong result without warning. The warning is related to the recycling rule. In the latter case,1:3
can be recycled to have the same length of1:6
(that is, the length of1:6
is a multiple of the length of1:3
), so you will be testing with1:6 != c(1:3, 1:3)
.A simple example.