对准缺失值序列(Aligning sequences with missing values)

我使用的语言是R，但你并不一定需要知道关于R来回答这个问题。

问：我有一个可以考虑的地面实况序列，另一个序列是第一个移位版本，有一些缺失值。我想知道如何对准两个。

设定

我有一个序列ground.truth ，基本上是一组时间：

ground.truth <- rep( seq(1,by=4,length.out=10), 5 ) +
                rep( seq(0,length.out=5,by=4*10+30), each=10 )

想想ground.truth随着时代在那里我做了以下内容：

{take a sample every 4 seconds for 10 times, then wait 30 seconds} x 5

我有一个第二序列observations ，这是ground.truth缺少的值的20％的变化：

nSamples <- length(ground.truth)
idx_to_keep <- sort(sample( 1:nSamples, .8*nSamples ))
theLag <- runif(1)*100
observations <- ground.truth[idx_to_keep] + theLag
nObs     <- length(observations)

如果我描绘出这些载体，这是什么样子（记住，想到这些作为次）：

我已经试过。 我想：

计算移位（ theLag在我的上面的例子）
计算一个矢量idx ，使得ground.truth[idx] == observations - theLag

首先，假设我们知道theLag 。需要注意的是ground.truth[1]不一定observations[1]-theLag 。事实上，我们有ground.truth[1] == observations[1+lagI]-theLag一些lagI 。

为了计算这个，我想我会用交叉相关（ ccf功能）。

但是，每当我这样做，我得到了最大滞后。的0交叉相关，这意味着ground.truth[1] == observations[1] - theLag 。但我已经在我明确确信例子试过这种observations[1] - theLag 不 ground.truth[1]即修改idx_to_keep ，以确保它没有1的话）。

移位theLag不应影响交叉相关（不ccf(x,y) == ccf(x,y-constant) ？），所以我打算以后去解决它。

也许我误解，但因为observations不也有多达值ground.truth ？即使在我设置了简单的情况theLag==0 ，互相关函数仍然无法识别正确的滞后，这使我相信，我在想这是错误的。

有没有人有一个一般方法对我来说，去这，或知道一些R里面的函数/包，可以帮助的？

非常感谢。

Answer 1:

对于滞后，你可以计算你所有的两套点之间的差异（距离）：

diffs <- outer(observations, ground.truth, '-')

你的滞后应该出现的值length(observations)时间：

which(table(diffs) == length(observations))
# 55.715382960625 
#              86

再检查一遍：

theLag
# [1] 55.71538

一旦你找到你问题的第二部分是容易theLag ：

idx <- which(ground.truth %in% (observations - theLag))

Answer 2:

如果你的时间序列不是太长以下应该工作。

你有时间标记，第二个是第一的移位和不完整的副本的两个向量，并且要通过它有多少转移到寻找。

# Sample data
n <- 10
x <- cumsum(rexp(n,.1))
theLag <- rnorm(1)
y <- theLag + x[sort(sample(1:n, floor(.8*n)))]

我们可以尝试所有可能的滞后，并且对于每一个，计算排列有多坏，与最接近的“真相”时间戳每个观测时间戳匹配。

# Loss function
library(sqldf)
f <- function(u) {
  # Put all the values in a data.frame
  d1 <- data.frame(g="truth",    value=x)
  d2 <- data.frame(g="observed", value=y+u)
  d <- rbind(d1,d2)
  # For each observed value, find the next truth value
  # (we could take the nearest, on either side, 
  # but it would be more complicated)
  d <- sqldf("
    SELECT A.g, A.value, 
           ( SELECT MIN(B.value) 
             FROM   d AS B 
             WHERE  B.g='truth' 
             AND    B.value >= A.value
           ) AS next
    FROM   d AS A
    WHERE  A.g = 'observed'
  ")
  # If u is greater than the lag, there are missing values.
  # If u is smaller, the differences decrease 
  # as we approach the lag.
  if(any(is.na(d))) {
    return(Inf)
  } else {
    return( sum(d$`next` - d$value, na.rm=TRUE) )
  }
}

现在，我们可以寻找最好的滞后。

# Look at the loss function
sapply( seq(-2,2,by=.1), f )

# Minimize the loss function.
# Change the interval if it does not converge, 
# i.e., if it seems in contradiction with the values above
# or if the minimum is Inf
(r <- optimize(f, c(-3,3)))
-r$minimum
theLag # Same value, most of the time

文章来源: Aligning sequences with missing values