R - Inconsistent p-value in running Spearman corre

2019-08-13 10:08发布

问题:

My problem is when I compute running correlation for some odd reason I do not get the same p-value for the same estimates/correlations values.

My target is to calculate a running Spearman correlation on two vectors in the same data.frame (subject1 and subject2 in the example below). In addition, my window (length of the vector) and stide (the jumps/steps between each window) are constant. As such, when looking at the formula below (from wiki) I should get the same critical t hence the same p-value for the same Spearman correlation. These is because the n states the same (it's the same window size) and the r is same. However, my end p value is different.

#Needed pkgs    
require(tidyverse)
require(pspearman)
require(gtools)

#Sample data
set.seed(528)
subject1 <- rnorm(40, mean = 85, sd = 5)

set.seed(528)
subject2 <- c(
  lag(subject1[1:21]) - 10, 
  rnorm(n = 6, mean = 85, sd = 5), 
  lag(subject1[length(subject1):28]) - 10)

df <- data.frame(subject1 = subject1, 
                 subject2 = subject2) %>% 
  rowid_to_column(var = "Time") 

df[is.na(df)] <- subject1[1] - 10

rm(subject1, subject2)

#Function for Spearman
psSpearman <- function(x, y) 
{
  out <- pspearman::spearman.test(x, y,
                                  alternative = "two.sided", 
                                  approximation = "t-distribution") %>% 
    broom::tidy()
  return(data.frame(estimate = out$estimate,
                    statistic = out$statistic,
                    p.value = out$p.value )
}

#Running correlation along the subjects
dfRunningCor <- running(df$subject1, df$subject2, 
                        fun = psSpearman,
                        width = 20,
                        allow.fewer = FALSE, 
                        by = 1,
                        pad = FALSE, 
                        align = "right") %>% 
  t() %>% 
  as.data.frame() 

#Arranging the Results into easy to handle data.frame 
Results <- do.call(rbind.data.frame, dfRunningCor) %>% 
  t() %>%
  as.data.frame() %>%
  rownames_to_column(var = "Win") %>% 
  gather(CorValue, Value, -Win) %>% 
  separate(Win, c("fromIndex", "toIndex")) %>%
  mutate(fromIndex = as.numeric(substring(fromIndex, 2)),
         toIndex = as.numeric(toIndex, 2)) %>%
  spread(CorValue, Value) %>% 
  arrange(fromIndex) %>% 
  select(fromIndex, toIndex, estimate, statistic, p.value)

My problem is when I plot the Results with estimates (Spearman rho;estimate), window number (fromIndex) and I color the p value, I should get like a "tunnel"/"path" of the same color across the same area - I don't. For example, in the picture below, points in the same height in the red circle should be with the same color - but the aren't.

Code for the graph:

Results %>% 
  ggplot(aes(fromIndex, estimate, color = p.value)) + 
  geom_line()

What I found so far is that it might might be due to: 1. Functions like Hmisc::rcorr() tend to not give the same p.value in small sample or many ties. This is why I use pspearman::spearman.test which from what I read here suppose to solve this problem. 2. Small sample size - I tried using a bigger sample size. I still get the same problem. 3. I tried rounding my p values - I still get the same problem.

Thank you for your help!

Edit.

Could it be "pseudo" coloring by ggplot? Could it be that ggplot just interpolate "last" color until the next point?. Which is why I get "light blue" from point 5 to 6 but "dark blue" from point 7 to 8?

回答1:

The results you obtain for the p.value variable are coherent with the estimate value. You can check it as follows:

Results$orderestimate <- order(-abs(Results$estimate))
Results$orderp.value <- order(abs(Results$p.value))
identical(Results$orderestimate ,Results$orderp.value)

I don't think you should include a colour for the p.value in the graph, it is an unnecessary visual distraction and it is hard to interpret.

If I were you I would only display the p.value and perhaps include a point to indicate the sign of the estimate variable.

p <- Results %>% 
  ggplot(aes(fromIndex,  p.value)) + 
  geom_line()

# If you want to display the sign of the estimate
Results$estimate.sign <- as.factor(sign(Results$estimate))
p+geom_point( aes(color = estimate.sign ))