arrange dataframe based on one column eliminating

2020-07-30 03:25发布

问题:

I have this data

       date                                           signal 
1   2009-01-13 09:55:00  4645.00  4838.931  5358.883  Buy2
2   2009-01-14 09:55:00  4767.50  4718.254  5336.703  Buy1
3   2009-01-15 09:55:00  4485.00  4653.316  5274.384  Buy2
4   2009-01-16 09:55:00  4580.00  4537.693  5141.435  Buy1
5   2009-01-19 09:55:00  4532.00  4548.088  4891.041  Buy2
6   2009-01-27 09:55:00  4190.00  4183.503  4548.497  Buy1
7   2009-01-30 09:55:00  4436.00  4155.236  4377.907 Sell1
8   2009-02-02 09:55:00  4217.00  4152.626  4390.802 Sell2
9   2009-02-09 09:55:00  4469.00  4203.437  4376.277 Sell1
10  2009-02-12 09:55:00  4469.90  4220.845  4503.798 Sell2
11  2009-02-13 09:55:00  4553.00  4261.980  4529.777 Sell1
12  2009-02-16 09:55:00  4347.20  4319.656  4564.387 Sell2
13  2009-02-17 09:55:00  4161.05  4371.474  4548.912  Buy2
14  2009-02-27 09:55:00  3875.55  3862.085  4101.929  Buy1
15  2009-03-02 09:55:00  3636.00  3846.423  4036.020  Buy2
16  2009-03-12 09:55:00  3420.00  3372.665  3734.949  Buy1
17  2009-03-13 09:55:00  3656.00  3372.100  3605.357 Sell1
18  2009-03-17 09:55:00  3650.00  3360.421  3663.322 Sell2
19  2009-03-18 09:55:00  3721.00  3363.735  3682.293 Sell1
20  2009-03-20 09:55:00  3687.00  3440.651  3784.778 Sell2

and have to arrange it in this form

2   2009-01-14 09:55:00  4767.50  4718.254  5336.703  Buy1
7   2009-01-30 09:55:00  4436.00  4155.236  4377.907 Sell1
8   2009-02-02 09:55:00  4217.00  4152.626  4390.802 Sell2
13  2009-02-17 09:55:00  4161.05  4371.474  4548.912  Buy2
14  2009-02-27 09:55:00  3875.55  3862.085  4101.929  Buy1
17  2009-03-13 09:55:00  3656.00  3372.100  3605.357 Sell1
18  2009-03-17 09:55:00  3650.00  3360.421  3663.322 Sell2

So that data is arranged in order of Buy1 Sell1 Sell2 Buy2 and eliminating the middle observations. I have tried several dplyr:filter commands but none is giving the desired output.

回答1:

If I have well understood your problem, the following code should solve it. It is adapted from this discussion.

The idea is to define your sequence as a pattern:

pattern <- c("Buy1", "Sell1", "Sell2", "Buy2")

Then find the position of this pattern in your column:

library(zoo)
 pos <- which(rollapply(data$signal, 4, identical, pattern, fill = FALSE, align = "left")) 

and extract the rows following the position of your patterns:

rows <- unlist(lapply(pos, function(x, n) seq(x, x+n-1), 4))
data_filtered <- data[rows,]

Voilà.

EDIT

Since I had misunderstood your problem, here is a new solution. You want to retrieve the sequence "Buy1", "Sell1", "Sell2", "Buy2" in your column, and eliminate the observations that do not fit in this sequence. I do not see a trivial vectorised solution, so here is a loop to solve that. Depending on the size of your data, you may want to implement a similar algorithm in RCPP or vectorise it in some ways.

sequence <- c("Buy1", "Sell1", "Sell2", "Buy2")
keep <- logical(length(data$signal))

s <- 0
for (i in seq(1, length(data$signal))){
    if (sequence[s +1] == data$signal[i]){
        keep[i] <- T
        s <- (s + 1) %% 4
    } else {
        keep[i] <- F
    }
}

data_filtered <- data[keep,]

Tell me if this work better. If anyone has a vectorised solution, I would be curious to see it.



回答2:

You can coerce the column data$signal into a factor and define the levels.

data$signal <- as.factor(data.$signal, levels = c("Buy1","Sell1","Buy2","Sell2")

Then you can sort it

sorted.data <- data[order(signal),]

Here is a great answer that talks about what you want to do:

Sort data frame column by factor



回答3:

Here is a Rcpp solution:

library(Rcpp)

cppFunction('LogicalVector FindHit(const CharacterVector x, const CharacterVector y) {
    LogicalVector res(x.size());
    int k = 0;
    for(int i = 0; i < x.size(); i++){
        if(x[i] == y[k]){
            res[i] = true;
            k = (k + 1) % y.size();
        }
    }
    return res;
}')

dtt[FindHit(dtt$V6, c('Buy1', 'Sell1', 'Sell2', 'Buy2')),]

#            V1       V2      V3       V4       V5    V6
# 2  2009-01-14 09:55:00 4767.50 4718.254 5336.703  Buy1
# 7  2009-01-30 09:55:00 4436.00 4155.236 4377.907 Sell1
# 8  2009-02-02 09:55:00 4217.00 4152.626 4390.802 Sell2
# 13 2009-02-17 09:55:00 4161.05 4371.474 4548.912  Buy2
# 14 2009-02-27 09:55:00 3875.55 3862.085 4101.929  Buy1
# 17 2009-03-13 09:55:00 3656.00 3372.100 3605.357 Sell1
# 18 2009-03-17 09:55:00 3650.00 3360.421 3663.322 Sell2

Here is the dtt:

> dput(dtt)
structure(list(V1 = c("2009-01-13", "2009-01-14", "2009-01-15", 
"2009-01-16", "2009-01-19", "2009-01-27", "2009-01-30", "2009-02-02", 
"2009-02-09", "2009-02-12", "2009-02-13", "2009-02-16", "2009-02-17", 
"2009-02-27", "2009-03-02", "2009-03-12", "2009-03-13", "2009-03-17", 
"2009-03-18", "2009-03-20"), V2 = c("09:55:00", "09:55:00", "09:55:00", 
"09:55:00", "09:55:00", "09:55:00", "09:55:00", "09:55:00", "09:55:00", 
"09:55:00", "09:55:00", "09:55:00", "09:55:00", "09:55:00", "09:55:00", 
"09:55:00", "09:55:00", "09:55:00", "09:55:00", "09:55:00"), 
    V3 = c(4645, 4767.5, 4485, 4580, 4532, 4190, 4436, 4217, 
    4469, 4469.9, 4553, 4347.2, 4161.05, 3875.55, 3636, 3420, 
    3656, 3650, 3721, 3687), V4 = c(4838.931, 4718.254, 4653.316, 
    4537.693, 4548.088, 4183.503, 4155.236, 4152.626, 4203.437, 
    4220.845, 4261.98, 4319.656, 4371.474, 3862.085, 3846.423, 
    3372.665, 3372.1, 3360.421, 3363.735, 3440.651), V5 = c(5358.883, 
    5336.703, 5274.384, 5141.435, 4891.041, 4548.497, 4377.907, 
    4390.802, 4376.277, 4503.798, 4529.777, 4564.387, 4548.912, 
    4101.929, 4036.02, 3734.949, 3605.357, 3663.322, 3682.293, 
    3784.778), V6 = c("Buy2", "Buy1", "Buy2", "Buy1", "Buy2", 
    "Buy1", "Sell1", "Sell2", "Sell1", "Sell2", "Sell1", "Sell2", 
    "Buy2", "Buy1", "Buy2", "Buy1", "Sell1", "Sell2", "Sell1", 
    "Sell2")), row.names = c(NA, -20L), class = "data.frame")