Removing dataframe outliers in R with `boxplot.sta

2019-08-05 17:03发布

问题:

I'm relatively new at R, so please bear with me.

I'm using the Ames dataset (full description of dataset here; link to dataset download here).

I'm trying to create a subset data frame that will allow me to run a linear regression analysis, and I'm trying to remove the outliers using the boxplot.stats function. I created a frame that will include my samples using the following code:

regressionFrame <- data.frame(subset(ames_housing_data[,c('SalePrice','GrLivArea','LotArea')] , BldgType == '1Fam'))

My next objective was to remove the outliers, so I tried to subset using a which() function:

regressionFrame <- regressionFrame[which(regressionFrame$GrLivArea != boxplot.stats(regressionFrame$GrLivArea)$out),]

Unfortunately, that produced the

longer object length is not a multiple of shorter object length

error. Does anyone know a better way to approach this, ideally using the which() subsetting function? I'm assuming it would include some form of lapply(), but for the life of me I can't figure out how. (I figure I can always learn fancier methods later, but this is the one I'm going for right now since I already understand it.)

回答1:

Nice use with boxplot.stats.

You can not test SAFELY using != if boxplot.stats returns you more than one outliers in $out. An analogy here is 1:5 != 1:3. You probably want to try !(1:5 %in% 1:3).

regressionFrame <- subset(regressionFrame,
                          subset = !(GrLivArea %in% boxplot.stats(GrLivArea)$out))

What I mean by SAFELY, is that 1:5 != 1:3 gives a wrong result with a warning, but 1:6 != 1:3 gives a wrong result without warning. The warning is related to the recycling rule. In the latter case, 1:3 can be recycled to have the same length of 1:6 (that is, the length of 1:6 is a multiple of the length of 1:3), so you will be testing with 1:6 != c(1:3, 1:3).


A simple example.

x <- c(1:10/10, 101, 102, 103)  ## has three outliers: 101, 102 and 103
out <- boxplot.stats(x)$out  ## `boxplot.stats` has picked them out
x[x != out]  ## this gives a warning and wrong result
x[!(x %in% out)]  ## this removes them from x