I have this data
date signal
1 2009-01-13 09:55:00 4645.00 4838.931 5358.883 Buy2
2 2009-01-14 09:55:00 4767.50 4718.254 5336.703 Buy1
3 2009-01-15 09:55:00 4485.00 4653.316 5274.384 Buy2
4 2009-01-16 09:55:00 4580.00 4537.693 5141.435 Buy1
5 2009-01-19 09:55:00 4532.00 4548.088 4891.041 Buy2
6 2009-01-27 09:55:00 4190.00 4183.503 4548.497 Buy1
7 2009-01-30 09:55:00 4436.00 4155.236 4377.907 Sell1
8 2009-02-02 09:55:00 4217.00 4152.626 4390.802 Sell2
9 2009-02-09 09:55:00 4469.00 4203.437 4376.277 Sell1
10 2009-02-12 09:55:00 4469.90 4220.845 4503.798 Sell2
11 2009-02-13 09:55:00 4553.00 4261.980 4529.777 Sell1
12 2009-02-16 09:55:00 4347.20 4319.656 4564.387 Sell2
13 2009-02-17 09:55:00 4161.05 4371.474 4548.912 Buy2
14 2009-02-27 09:55:00 3875.55 3862.085 4101.929 Buy1
15 2009-03-02 09:55:00 3636.00 3846.423 4036.020 Buy2
16 2009-03-12 09:55:00 3420.00 3372.665 3734.949 Buy1
17 2009-03-13 09:55:00 3656.00 3372.100 3605.357 Sell1
18 2009-03-17 09:55:00 3650.00 3360.421 3663.322 Sell2
19 2009-03-18 09:55:00 3721.00 3363.735 3682.293 Sell1
20 2009-03-20 09:55:00 3687.00 3440.651 3784.778 Sell2
and have to arrange it in this form
2 2009-01-14 09:55:00 4767.50 4718.254 5336.703 Buy1
7 2009-01-30 09:55:00 4436.00 4155.236 4377.907 Sell1
8 2009-02-02 09:55:00 4217.00 4152.626 4390.802 Sell2
13 2009-02-17 09:55:00 4161.05 4371.474 4548.912 Buy2
14 2009-02-27 09:55:00 3875.55 3862.085 4101.929 Buy1
17 2009-03-13 09:55:00 3656.00 3372.100 3605.357 Sell1
18 2009-03-17 09:55:00 3650.00 3360.421 3663.322 Sell2
So that data is arranged in order of Buy1 Sell1 Sell2 Buy2 and eliminating the middle observations.
I have tried several dplyr:filter commands but none is giving the desired output.
If I have well understood your problem, the following code should solve it. It is adapted from this discussion.
The idea is to define your sequence as a pattern:
pattern <- c("Buy1", "Sell1", "Sell2", "Buy2")
Then find the position of this pattern in your column:
library(zoo)
pos <- which(rollapply(data$signal, 4, identical, pattern, fill = FALSE, align = "left"))
and extract the rows following the position of your patterns:
rows <- unlist(lapply(pos, function(x, n) seq(x, x+n-1), 4))
data_filtered <- data[rows,]
Voilà.
EDIT
Since I had misunderstood your problem, here is a new solution.
You want to retrieve the sequence "Buy1", "Sell1", "Sell2", "Buy2" in your column, and eliminate the observations that do not fit in this sequence. I do not see a trivial vectorised solution, so here is a loop to solve that. Depending on the size of your data, you may want to implement a similar algorithm in RCPP or vectorise it in some ways.
sequence <- c("Buy1", "Sell1", "Sell2", "Buy2")
keep <- logical(length(data$signal))
s <- 0
for (i in seq(1, length(data$signal))){
if (sequence[s +1] == data$signal[i]){
keep[i] <- T
s <- (s + 1) %% 4
} else {
keep[i] <- F
}
}
data_filtered <- data[keep,]
Tell me if this work better.
If anyone has a vectorised solution, I would be curious to see it.
You can coerce the column data$signal into a factor and define the levels.
data$signal <- as.factor(data.$signal, levels = c("Buy1","Sell1","Buy2","Sell2")
Then you can sort it
sorted.data <- data[order(signal),]
Here is a great answer that talks about what you want to do:
Sort data frame column by factor
Here is a Rcpp
solution:
library(Rcpp)
cppFunction('LogicalVector FindHit(const CharacterVector x, const CharacterVector y) {
LogicalVector res(x.size());
int k = 0;
for(int i = 0; i < x.size(); i++){
if(x[i] == y[k]){
res[i] = true;
k = (k + 1) % y.size();
}
}
return res;
}')
dtt[FindHit(dtt$V6, c('Buy1', 'Sell1', 'Sell2', 'Buy2')),]
# V1 V2 V3 V4 V5 V6
# 2 2009-01-14 09:55:00 4767.50 4718.254 5336.703 Buy1
# 7 2009-01-30 09:55:00 4436.00 4155.236 4377.907 Sell1
# 8 2009-02-02 09:55:00 4217.00 4152.626 4390.802 Sell2
# 13 2009-02-17 09:55:00 4161.05 4371.474 4548.912 Buy2
# 14 2009-02-27 09:55:00 3875.55 3862.085 4101.929 Buy1
# 17 2009-03-13 09:55:00 3656.00 3372.100 3605.357 Sell1
# 18 2009-03-17 09:55:00 3650.00 3360.421 3663.322 Sell2
Here is the dtt
:
> dput(dtt)
structure(list(V1 = c("2009-01-13", "2009-01-14", "2009-01-15",
"2009-01-16", "2009-01-19", "2009-01-27", "2009-01-30", "2009-02-02",
"2009-02-09", "2009-02-12", "2009-02-13", "2009-02-16", "2009-02-17",
"2009-02-27", "2009-03-02", "2009-03-12", "2009-03-13", "2009-03-17",
"2009-03-18", "2009-03-20"), V2 = c("09:55:00", "09:55:00", "09:55:00",
"09:55:00", "09:55:00", "09:55:00", "09:55:00", "09:55:00", "09:55:00",
"09:55:00", "09:55:00", "09:55:00", "09:55:00", "09:55:00", "09:55:00",
"09:55:00", "09:55:00", "09:55:00", "09:55:00", "09:55:00"),
V3 = c(4645, 4767.5, 4485, 4580, 4532, 4190, 4436, 4217,
4469, 4469.9, 4553, 4347.2, 4161.05, 3875.55, 3636, 3420,
3656, 3650, 3721, 3687), V4 = c(4838.931, 4718.254, 4653.316,
4537.693, 4548.088, 4183.503, 4155.236, 4152.626, 4203.437,
4220.845, 4261.98, 4319.656, 4371.474, 3862.085, 3846.423,
3372.665, 3372.1, 3360.421, 3363.735, 3440.651), V5 = c(5358.883,
5336.703, 5274.384, 5141.435, 4891.041, 4548.497, 4377.907,
4390.802, 4376.277, 4503.798, 4529.777, 4564.387, 4548.912,
4101.929, 4036.02, 3734.949, 3605.357, 3663.322, 3682.293,
3784.778), V6 = c("Buy2", "Buy1", "Buy2", "Buy1", "Buy2",
"Buy1", "Sell1", "Sell2", "Sell1", "Sell2", "Sell1", "Sell2",
"Buy2", "Buy1", "Buy2", "Buy1", "Sell1", "Sell2", "Sell1",
"Sell2")), row.names = c(NA, -20L), class = "data.frame")