I'm trying to group several consecutives rows (and assigning them the same value) while leaving some of the rows empty (when a certain condition is not fulfilled).
My data are locations (xy coordinates), the date/time at which they were measured, and the time span between measures. Somehow simplified, they look like this:
ID X Y Time Span
1 3445 7671 0:00 -
2 3312 7677 4:00 4
3 3309 7680 12:00 8
4 3299 7681 16:00 4
5 3243 7655 20:00 4
6 3222 7612 4:00 8
7 3260 7633 0:00 4
8 3254 7641 8:00 8
9 3230 7612 0:00 16
10 3203 7656 4:00 4
11 3202 7678 8:00 4
12 3159 7609 20:00 12
...
I'd like to assign a value to every sequence of locations that are measured within a time span of 4 hours, and make my data look like this:
ID X Y Time Span Sequence
1 3445 7671 0:00 - -
2 3312 7677 4:00 4 1
3 3309 7680 12:00 8 NA
4 3299 7681 16:00 4 2
5 3243 7655 20:00 4 2
6 3222 7612 4:00 8 NA
7 3260 7633 0:00 4 3
8 3254 7641 8:00 8 NA
9 3230 7612 0:00 16 NA
10 3203 7656 4:00 4 4
11 3202 7678 8:00 4 4
12 3159 7609 20:00 12 NA
I've tried several algorithms with a loop "for" plus "ifelse" condition like:
Sequence <- for (i in 1:max(ID)) {
ifelse (Span <= 4, i+1, "NA")
}
without any luck. I know my attempt is incorrect, but my programming skills are really basic and I haven't found any similar problem in the web.
Any ideas would be very appreciated!
Here is a longish one liner:
Explanation:
x
is a vector of TRUE/FALSE showing whereSpan
is4
.tail(x, -1)
is a safe way of writingx[2:length(x)]
head(x, -1)
is a safe way of writingx[1:(length(x)-1)]
tail(x, -1) - head(x, -1) == 1
is a vector of TRUE/FALSE showing where we went fromSpan != 4
toSpan == 4
.x
, I prependedhead(x, 1)
in front of it.head(x, 1)
is a safe way of writingx[1]
.cumsum
so it converts the vector TRUE/FALSE into a vector of increasing integers: whereSpan
jumps from!=4
to==4
it increases by 1, otherwise stays constant.ifelse
so you only see numbers wherex
is TRUE, i.e., whereSpan == 4
.Here's another alternative using
rle
andrep
. We'll assume that yourdata.frame
is named "test".First, initialize your "Sequence" column, filling it with
NA
.Second, specify the condition that you are matching, in this case,
test$Span == 4
.Third, use the combination of
rle
's output (lengths
andvalues
) to get how many times each new run in the sequence occurs.Finally, use
rep
with thetimes
argument set to the result obtained in step 3. Subset the required values oftest$Sequence
according to the index matched bytest$Span == 4
, and replace them with your new sequence.Once you understand the steps involved, you can also do this directly with
within()
. The following would give you the same result: