I'm trying to group several consecutives rows (and assigning them the same value) while leaving some of the rows empty (when a certain condition is not fulfilled).
My data are locations (xy coordinates), the date/time at which they were measured, and the time span between measures. Somehow simplified, they look like this:
ID X Y Time Span
1 3445 7671 0:00 -
2 3312 7677 4:00 4
3 3309 7680 12:00 8
4 3299 7681 16:00 4
5 3243 7655 20:00 4
6 3222 7612 4:00 8
7 3260 7633 0:00 4
8 3254 7641 8:00 8
9 3230 7612 0:00 16
10 3203 7656 4:00 4
11 3202 7678 8:00 4
12 3159 7609 20:00 12
...
I'd like to assign a value to every sequence of locations that are measured within a time span of 4 hours, and make my data look like this:
ID X Y Time Span Sequence
1 3445 7671 0:00 - -
2 3312 7677 4:00 4 1
3 3309 7680 12:00 8 NA
4 3299 7681 16:00 4 2
5 3243 7655 20:00 4 2
6 3222 7612 4:00 8 NA
7 3260 7633 0:00 4 3
8 3254 7641 8:00 8 NA
9 3230 7612 0:00 16 NA
10 3203 7656 4:00 4 4
11 3202 7678 8:00 4 4
12 3159 7609 20:00 12 NA
I've tried several algorithms with a loop "for" plus "ifelse" condition like:
Sequence <- for (i in 1:max(ID)) {
ifelse (Span <= 4, i+1, "NA")
}
without any luck. I know my attempt is incorrect, but my programming skills are really basic and I haven't found any similar problem in the web.
Any ideas would be very appreciated!
Here is a longish one liner:
ifelse(x <- DF$Span == 4, cumsum(c(head(x, 1), tail(x, -1) - head(x, -1) == 1)), NA)
# [1] NA 1 NA 2 2 NA 3 NA NA 4 4 NA
Explanation:
x
is a vector of TRUE/FALSE showing where Span
is 4
.
tail(x, -1)
is a safe way of writing x[2:length(x)]
head(x, -1)
is a safe way of writing x[1:(length(x)-1)]
tail(x, -1) - head(x, -1) == 1
is a vector of TRUE/FALSE showing where we went from Span != 4
to Span == 4
.
- since the vector above is one element shorter than
x
, I prepended head(x, 1)
in front of it. head(x, 1)
is a safe way of writing x[1]
.
- Then I take the
cumsum
so it converts the vector TRUE/FALSE into a vector of increasing integers: where Span
jumps from !=4
to ==4
it increases by 1, otherwise stays constant.
- Everything is wrapped into an
ifelse
so you only see numbers where x
is TRUE, i.e., where Span == 4
.
Here's another alternative using rle
and rep
. We'll assume that your data.frame
is named "test".
First, initialize your "Sequence" column, filling it with NA
.
test$Sequence <- NA
Second, specify the condition that you are matching, in this case, test$Span == 4
.
x <- test$Span == 4
Third, use the combination of rle
's output (lengths
and values
) to get how many times each new run in the sequence occurs.
spanSeq <- rle(x)$lengths[rle(x)$values == TRUE]
Finally, use rep
with the times
argument set to the result obtained in step 3. Subset the required values of test$Sequence
according to the index matched by test$Span == 4
, and replace them with your new sequence.
test$Sequence[x] <- rep(seq_along(spanSeq), times = spanSeq)
test
# ID X Y Time Span Sequence
# 1 1 3445 7671 0:00 - NA
# 2 2 3312 7677 4:00 4 1
# 3 3 3309 7680 12:00 8 NA
# 4 4 3299 7681 16:00 4 2
# 5 5 3243 7655 20:00 4 2
# 6 6 3222 7612 4:00 8 NA
# 7 7 3260 7633 0:00 4 3
# 8 8 3254 7641 8:00 8 NA
# 9 9 3230 7612 0:00 16 NA
# 10 10 3203 7656 4:00 4 4
# 11 11 3202 7678 8:00 4 4
# 12 12 3159 7609 20:00 12 NA
Once you understand the steps involved, you can also do this directly with within()
. The following would give you the same result:
within(test, {
Sequence <- NA
spanSeq <- rle(Span == 4)$lengths[rle(Span == 4)$values == TRUE]
Sequence[Span == 4] <- rep(seq_along(spanSeq), times = spanSeq)
rm(spanSeq)
})
count = 0
for (i in 1:max(ID)) {
Sequence[i] = ifelse(Span[i] <= 4, count <- count+1, NA)
}