The spearman rho and trendline do not match

2019-08-09 21:33发布

I have an issue where the trendline for the second grid looks negative whereas the spearman correlation is weak positive (0.1). I would appreciate if someone can clarify whether the difference of direction is due to incorrect formula or weak correlation.

I also realized that similar issue occurs at rho=-0.3 where the trendline is positive.

Thanks.

sc_df
    OTU_166911 Body weight EXPT         Group
68   41.132985        36.5 ABX2 S T2 HFHS+amp
69   15.589949        34.8 ABX2 S T2 HFHS+amp
70   15.504802        30.5 ABX2 S T2 HFHS+amp
71    5.339616        35.8 ABX2 S T2 HFHS+amp
72   40.697005        33.9 ABX2 S T2 HFHS+amp
188   2.893428        33.4 ABX3 S T2 HFHS+amp
189  20.891697        37.6 ABX3 S T2 HFHS+amp
190   3.195469        40.5 ABX3 S T2 HFHS+amp
191   2.689137        34.2 ABX3 S T2 HFHS+amp
192  13.997269        30.0 ABX3 S T2 HFHS+amp

df4
       Group EXPT value
1 S T2 HFHS+amp ABX2  0.30
2 S T2 HFHS+amp ABX3  0.10


ggplot(sc_df, aes(x = sc_df[,partner1], y = sc_df[,partner2])) + 
            geom_point(shape=1, color="blue", size = 3) +    
            geom_smooth(method="lm", se=FALSE) + 
            facet_wrap(~EXPT, scales = "free") +    
            geom_text(data=df4, aes(x=Inf, y=Inf,hjust=2,vjust=2, label=paste("rho==",value,sep="")), parse=T, family = "Arial", size=4) +
            xlab(partner1) + 
            ylab(partner2) + 
            theme(plot.title = element_text(hjust = 0.5),text=element_text(family="Arial", size=10)) +
            ggtitle(g)

Scatter plot

标签: r ggplot2
1条回答
ゆ 、 Hurt°
2楼-- · 2019-08-09 22:11

The discrepancy is due to using Spearman's rho, while the trendline is based on a linear model, i.e. Pearson's r.

Consider the relevant text from ?cor:

For cor(), if method is "kendall" or "spearman", Kendall's tau or Spearman's rho statistic is used to estimate a rank-based measure of association. These are more robust and have been recommended if the data do not necessarily come from a bivariate normal distribution. ... Note that "spearman" basically computes cor(R(x), R(y)) ... where R(u) := rank(u, na.last = "keep").

I renamed your variables for simplicity:

dput(temp)
structure(list(x = c(41.132985, 15.589949, 15.504802, 5.339616, 
40.697005, 2.893428, 20.891697, 3.195469, 2.689137, 13.997269
), y = c(36.5, 34.8, 30.5, 35.8, 33.9, 33.4, 37.6, 40.5, 34.2, 
30), z = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("ABX2", 
"ABX3"), class = "factor")), class = "data.frame", row.names = c(NA, 
-10L), .Names = c("x", "y", "z"))

First, we'll prove that the definition of Spearman's rho is correct and that it differs from Pearson's r.

library(dplyr)

temp %>% 
  group_by(z) %>% 
  mutate(RX = rank(x), RY = rank(y)) %>% 
  summarise(rho1 = cor(x, y, method = "spearman"),
            rho2 = cor(RX, RY, method = "pearson"),
            r = cor(x, y, method = "pearson"))
       z  rho1  rho2           r
  <fctr> <dbl> <dbl>       <dbl>
1   ABX2   0.3   0.3  0.20366115
2   ABX3   0.1   0.1 -0.08183435

Notice that the two values for rho are the same, but that they differ in sign and magnitude from r.

Reasons for this include yes, poor correlation, but also that ranking removes any information about how far apart each observation is. Even if two observations are infinitesimally close together, they'll still be 1 ranking apart. Likewise, two observations could be hugely different, but if there are none between them, they'll only be 1 ranking apart.

Have a look:

temp %>% 
  group_by(z) %>% 
  mutate(RX = rank(x), RY = rank(y)) %>% 
  ggplot(aes(x, y)) + 
  geom_point() +
  geom_text(aes(label = paste0("RX=", RX, "\nRY=", RY))) +
  facet_grid(~z)

enter image description here

Notice the two left-most points in the right panel. Even though they're very close together, their rankings are only 1 unit apart in each direction. As far as rho is concerned, they share as much information in the y-direction as the top two points, which are much farther apart.

To illustrate how much this can change the values, let's rescale the ranks to the scale of the original values. The original computation of rank gives you 1 to 5, let's instead make those evenly spaced across, for example, 5.3 to 41.1 for the first group in the X direction.

library(scales)

temp %>% 
  group_by(z) %>% 
  mutate(RX = rank(x), RY = rank(y),
         scaledRX = scales::rescale(RX, to = range(x)),
         scaledRY = scales::rescale(RY, to = range(y)))
           x     y      z    RX    RY  scaledRX scaledRY
       <dbl> <dbl> <fctr> <dbl> <dbl>     <dbl>    <dbl>
 1 41.132985  36.5   ABX2     5     5 41.132985   36.500
 2 15.589949  34.8   ABX2     3     3 23.236300   33.500
 3 15.504802  30.5   ABX2     2     1 14.287958   30.500
 4  5.339616  35.8   ABX2     1     4  5.339616   35.000
 5 40.697005  33.9   ABX2     4     2 32.184643   32.000
 6  2.893428  33.4   ABX3     2     2  7.239777   32.625
 7 20.891697  37.6   ABX3     5     4 20.891697   37.875
 8  3.195469  40.5   ABX3     3     5 11.790417   40.500
 9  2.689137  34.2   ABX3     1     3  2.689137   35.250
10 13.997269  30.0   ABX3     4     1 16.341057   30.000

Visually, this looks like:

temp %>% 
  group_by(z) %>% 
  mutate(RX = rank(x), RY = rank(y),
         scaledRX = scales::rescale(RX, to = range(x)),
         scaledRY = scales::rescale(RY, to = range(y))) %>% 
  ggplot(aes(x, y)) + 
  geom_point(aes(shape = "original")) + 
  geom_point(aes(scaledRX, scaledRY, shape = "ranked")) +
  geom_segment(aes(xend = scaledRX, yend = scaledRY)) +
  geom_smooth(method = "lm", se = F, aes(color = "original")) +
  geom_smooth(method = "lm", se = F, aes(scaledRX, scaledRY, color = "ranked")) +
  facet_grid(~z) +
  scale_shape_manual(values = c(1,16))

enter image description here

You can see that some points barely move, while some move a lot. Those discrepancies are enough to change the magnitude, and sometimes sign, of the correlation coefficient.

查看更多
登录 后发表回答