如何从数据框不一致(时间序列)(How to remove inconsistencies from

2019-09-28 04:58发布

比方说,我们有这个数据帧:

x<- as.data.frame(cbind(c("A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D"),
                        c(1,2,3,1,2,3,2,3,1,2,3,4,5,1,2,3,4,5),
                        c(10,12.5,15,2,3.4,5.7,8,9.5,1,5.6,8.9,10,11,2,3.4,6,8,10.5),
                        c(1,3,4,1,2,3,4,3,2,2,3,5,2,3,5,4,5,5)))
colnames(x)<- c("ID", "Visit", "Time", "State")

ID指示受试者ID。

Visit显示了一系列的访问

Time表示已经通过达到一定的“国家”的时间

State指某种疾病,其中5意味着死亡的严重程度。 这意味着,你可以从糟糕的状态波动,以更好的状态,但你永远无法从5类提高,因为你已经死了。

我想仅识别那些从5类提高到一个更好的受试者,因为这些从数据帧的错误(即,行13和16)。

此外,我想删除当个体似乎已经死了一次以上(即行18)的行。

我做了一个类似的问题之前 ,但它是非常笼统,它意味着所有的波动更好的状态,从数据集,它是不是其实我是想去除。

Answer 1:

答案修改问题

的OP已基本上被请求的所有行被认为是错误的状态5(死亡)的第一次出现之后,出现改性的问题。 这包括虚假回收率(如在行13和16),以及“重复死亡”(如在排17和18)。

这个答案需要一个完整的不同的方法。 一种可能性是使用非相等连接

library(data.table)
setDT(x)[x[, first(Visit[State == 5]), by = ID], on = .(ID, Visit > V1), error := TRUE][]
  ID Visit Time State error 1: A 1 10.0 1 NA 2: A 2 12.5 3 NA 3: A 3 15.0 4 NA 4: B 1 2.0 1 NA 5: B 2 3.4 2 NA 6: B 3 5.7 3 NA 7: B 2 8.0 4 NA 8: B 3 9.5 3 NA 9: C 1 1.0 2 NA 10: C 2 5.6 2 NA 11: C 3 8.9 3 NA 12: C 4 10.0 5 NA 13: C 5 11.0 2 TRUE 14: D 1 2.0 3 NA 15: D 2 3.4 5 NA 16: D 3 6.0 4 TRUE 17: D 4 8.0 5 TRUE 18: D 5 10.5 5 TRUE 

与5国首次访问数由返回

x[, first(Visit[State == 5]), by = ID]
  ID V1 1: C 4 2: D 2 

在随后的非等距只能加入标记的第一个国家5事件之后出现的那些行。

数据

x <- data.frame(
  ID = c("A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D"),
  Visit = c(1,2,3,1,2,3,2,3,1,2,3,4,5,1,2,3,4,5),
  Time = c(10,12.5,15,2,3.4,5.7,8,9.5,1,5.6,8.9,10,11,2,3.4,6,8,10.5),
  State = c(1,3,4,1,2,3,4,3,2,2,3,5,2,3,5,4,5,5))


Answer 2:

答案原来的问题

的OP已请求在所述数据帧,其中状态5之后是任何国家<5对于每个ID,以确定误差。 在采样数据设置的行13和16应该被标记。

该的Hardik古普塔答案在正确的方向点,但不会返回预期的结果。 所以,行12和15被标记,而不是行13和16.此外,有一个假警报用于排17设置。

有需要三种基本的变化:(1)使用lag来代替lead和(2)提供一个fillshift()

library(data.table)
setDT(x)[, error := State < 5 & shift(State, fill = 0) == 5, by = ID][]
  ID Visit Time State error 1: A 1 10.0 1 FALSE 2: A 2 12.5 3 FALSE 3: A 3 15.0 4 FALSE 4: B 1 2.0 1 FALSE 5: B 2 3.4 2 FALSE 6: B 3 5.7 3 FALSE 7: B 2 8.0 4 FALSE 8: B 3 9.5 3 FALSE 9: C 1 1.0 2 FALSE 10: C 2 5.6 2 FALSE 11: C 3 8.9 3 FALSE 12: C 4 10.0 5 FALSE 13: C 5 11.0 2 TRUE 14: D 1 2.0 3 FALSE 15: D 2 3.4 5 FALSE 16: D 3 6.0 4 TRUE 17: D 4 8.0 5 FALSE 18: D 5 10.5 5 FALSE 

数据

第三个变化是必需的,用于创建所述样本数据集。

cbind()返回一个矩阵,其接通所有列到相同的类型,是因子在这种情况下。 因此,由数字的所有列被视为因素。 为了避免这种情况,样本数据集需要被定义为:

 x <- data.frame( ID = c("A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D"), Visit = c(1,2,3,1,2,3,2,3,1,2,3,4,5,1,2,3,4,5), Time = c(10,12.5,15,2,3.4,5.7,8,9.5,1,5.6,8.9,10,11,2,3.4,6,8,10.5), State = c(1,3,4,1,2,3,4,3,2,2,3,5,2,3,5,4,5,5)) 


Answer 3:

您可以使用data.tableshift这样的

library(data.table)
setDT(x)[, status := ((State == 5) & (shift(State,1,"lead") != 5)), by = ID]
x
   ID Visit Time State status
1:  A     1   10     1  FALSE
2:  A     2 12.5     3  FALSE
3:  A     3   15     4  FALSE
4:  B     1    2     1  FALSE
5:  B     2  3.4     2  FALSE
6:  B     3  5.7     3  FALSE
7:  B     2    8     4  FALSE
8:  B     3  9.5     3  FALSE
9:  C     1    1     2  FALSE
10:  C     2  5.6     2  FALSE
11:  C     3  8.9     3  FALSE
12:  C     4   10     5   TRUE
13:  C     5   11     2  FALSE
14:  D     1    2     3  FALSE
15:  D     2  3.4     5   TRUE
16:  D     3    6     4  FALSE
17:  D     4    8     5   TRUE
18:  D     5 10.5     5  FALSE


Answer 4:

我还不清楚你想要做什么。 是不是行121517的错误的人,应予删除?

do.call(rbind.data.frame, lapply(tmp, function(w) {
    idx <- diff(w$State) <= 0 & w$State[-length(w$State)] == 5;
    w[!idx, ];
}))
#     ID Visit Time State
#A.1   A     1   10     1
#A.2   A     2 12.5     3
#A.3   A     3   15     4
#B.4   B     1    2     1
#B.5   B     2  3.4     2
#B.7   B     2    8     4
#B.6   B     3  5.7     3
#B.8   B     3  9.5     3
#C.9   C     1    1     2
#C.10  C     2  5.6     2
#C.11  C     3  8.9     3
#C.13  C     5   11     2
#D.14  D     1    2     3
#D.16  D     3    6     4
#D.18  D     5 10.5     5


文章来源: How to remove inconsistencies from dataframe (time series)