geom_smooth on a subset of data

2020-02-13 09:44发布

Here is some data and a plot:

set.seed(18)
data = data.frame(y=c(rep(0:1,3),rnorm(18,mean=0.5,sd=0.1)),colour=rep(1:2,12),x=rep(1:4,each=6))

ggplot(data,aes(x=x,y=y,colour=factor(colour)))+geom_point()+ geom_smooth(method='lm',formula=y~x,se=F)

enter image description here

As you can see the linear regression is highly influenced by the values where x=1. Can I get linear regressions calculated for x >= 2 but display the values for x=1 (y equals either 0 or 1). The resulting graph would be exactly the same except for the linear regressions. They would not "suffer" from the influence of the values on abscisse = 1

2条回答
Evening l夕情丶
2楼-- · 2020-02-13 10:35

It's as simple as geom_smooth(data=subset(data, x >= 2), ...). It's not important if this plot is just for yourself, but realize that something like this would be misleading to others if you don't include a mention of how the regression was performed. I'd recommend changing transparency of the points excluded.

ggplot(data,aes(x=x,y=y,colour=factor(colour)))+
geom_point(data=subset(data, x >= 2)) + geom_point(data=subset(data, x < 2), alpha=.2) +
geom_smooth(data=subset(data, x >= 2), method='lm',formula=y~x,se=F)

enter image description here

查看更多
Bombasti
3楼-- · 2020-02-13 10:44

The regular lm function has a weights argument which you can use to assign a weight to a particular observation. In this way you can plain with the influence which the observation has on the outcome. I think this is a general way of dealing with the problem in stead of subsetting the data. Of course, assigning weights ad hoc does not bode well for the statistical soundness of the analysis. It is always best to have a rationale behind the weights, e.g. low weight observations have a higher uncertainty.

I think under the hood ggplot2 uses the lm function so you should be able to pass the weights argument. You can add the weights through the aesthetic (aes), assuming that the weight is stored in a vector:

ggplot(data,aes(x=x,y=y,colour=factor(colour))) + 
    geom_point()+ stat_smooth(aes(weight = runif(nrow(data))), method='lm')

you could also put weight in a column in the dataset:

ggplot(data,aes(x=x,y=y,colour=factor(colour))) + 
    geom_point()+ stat_smooth(aes(weight = weight), method='lm')

where the column is called weight.

查看更多
登录 后发表回答