generate sequence (and starting over in case of a

2019-05-21 14:11发布

问题:

I am looking for a way to generate a sequence for a column with names of cities grouped by an ID. What is crucial is that when a name of a city is repeated (within the group) a new sequence has to start. A new sequence should also start in case of a new ID.

EDIT:

The question how to create the above mentioned sequence has been solved. To help select the row with the highest sequence number later on, I am looking for a way to add a new column to the data frame that shows for each record, per sequence, per ID the highest number of each sequence.

Here is an example of what I want to achieve, based on a simplified version of my data frame:

ID  City    Sequence    Highest_number
1   Nijmegen    1    2
1   Nijmegen    2    2
1   Arnhem      1    2
1   Arnhem      2    2
1   Nijmegen    1    1
1   Arnhem      1    3
1   Arnhem      2    3
1   Arnhem      3    3
1   Nijmegen    1    1
2   Nijmegen    1    1
2   Utrecht     1    1
2   Amsterdam   1    2
2   Amsterdam   2    2
2   Utrecht     1    4
2   Utrecht     2    4
2   Utrecht     3    4
2   Utrecht     4    4 

mydf <- data.frame(ID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2), 
        City = c("Nijmegen", "Nijmegen", "Arnhem", "Arnhem", "Nijmegen", 
        "Arnhem", "Arnhem","Arnhem", "Nijmegen", "Nijmegen", "Utrecht", 
       "Amsterdam", "Amsterdam", "Utrecht", "Utrecht", "Utrecht", "Utrecht"))

回答1:

Construct a 'run-length encoding' and use that to generate the sequences

rle <- rle(as.character(mydf$City))
mydf$Sequence <- unlist(lapply(rle$length, seq_len))

For the updated question, where two columns form the key, paste the columns together with a unique symbol and compute with that

rle <- rle(paste(mydf$ID, mydf$City, sep = "\r"))
mydf$Sequence <- unlist(lapply(rle$length, seq_len))

This will be 'fast', especially compared to a for loop.



回答2:

A good old for loop does the trick

mydf$Sequence <- NA

for(i in seq_len(nrow(mydf))) {
  if (i == 1 || (mydf$City[i] != mydf$City[i-1]) || (mydf$ID[i] != mydf$ID[i-1]))
    mydf$Sequence[i] <- 1
  else
    mydf$Sequence[i] <- mydf$Sequence[i-1] + 1

}


标签: r sequence sqldf