R - Conditional lagging - How to lag a certain amo

2020-07-22 18:15发布

问题:

Been trying to solve this for weeks, but can't seem to get it.

I have the following data frame:

    post_id user_id
1    post-1   user1
2    post-2   user2
3 comment-1   user1
4 comment-2   user3
5 comment-3   user4
6    post-3   user2
7 comment-4   user2

And want to create a new variable parent_id. So that for each observation it should perform the following steps:

  1. Check if post_id is either post or comment
  2. If post_id is post then parent_id should equal the earliest post_id of the whole data frame.
  3. If post_id is the first post then parent_id should equal NA
  4. If post_id is comment then parent_id should equal to the first post_id it encounters.

The output should look something like:

    post_id user_id parent_id_man
1    post-1   user1            NA
2    post-2   user2        post-1
3 comment-1   user1        post-2
4 comment-2   user3        post-2
5 comment-3   user4        post-2
6    post-3   user2        post-1
7 comment-4   user2        post-3

I have tried the following:

#Prepare data
df <- df %>% separate(post_id, into=c("type","number"), sep="-", remove=FALSE)
df$number <- as.numeric(df$number)
df <- df %>% mutate(comment_number = ifelse(type == "comment",number,99999))
df <- df %>% mutate(post_number = ifelse(type == "post",number,99999))

#Create parent_id column
df <- df %>% mutate(parent_id = ifelse(type == "post",paste("post-",min(post_number), sep=""),0))
df <- df %>% mutate(parent_id = ifelse(parent_id == post_id,"NA",parent_id))
df <- df %>% select(-comment_number, -post_number)

With that code I am able to perform Steps 1, 2 and 3, but step 4 is beyond me. I get the feeling that a certain type of conditional lagging based should be able to solve it, but can't come up with how to do it.

Any ideas would be very much appreciated!

回答1:

Building on your solution,

x <- which(df$type == 'post')
z <- which(df$type == 'comment')
df$parent_id[df$parent_id == 0] <- df$post_id[x[sapply(z, function(i) findInterval(i, x))]]
df$parent_id
#[1] "NA"     "post-1" "post-2" "post-2" "post-2" "post-1" "post-3"