Efficiently checking value of other row in data.ta

2020-07-11 08:09发布

Note: This is a question that I originally posted to the data.table help group. Matt Dowle asked for a more detailed example and I posted this one, but I had trouble with the formatting in e-mail. I already know how to format on SO, so I thought I would post it here instead.

What I am basically trying to do is subset rows from a data.table based on a value in that row as well as a value in a preceding or following row. Right now, I am creating new columns for future and past rows and then keying the data.table on these columns, but this is resource-intensive and onerous.

The below example illustrates the approach I am using now. The example uses words in documents (I use numeric indices for both). I want to subset for a particular word, but only if it is preceded or followed by another word or set of words:

I first create a dummy dataset with ten documents containing one million words. There are three unique words in the set.

library(data.table)
set.seed(1000)
DT<-data.table(wordindex=sample(1:3,1000000,replace=T),docindex=sample(1:10,1000000,replace=T))
setkey(DT,docindex)
DT[,position:=seq.int(1:.N),by=docindex]


          wordindex docindex position
      1:         1        1        1
      2:         1        1        2
      3:         3        1        3
      4:         3        1        4
      5:         1        1        5
    ---                            
 999996:         2       10    99811
 999997:         2       10    99812
 999998:         3       10    99813
 999999:         1       10    99814
1000000:         3       10    99815

Note that simply counting the occurrences of the first unique word across all documents is easy and beautiful.

setkey(DT,wordindex)
count<-DT[J(1),list(count.1=.N),by=docindex]
count

    docindex count.1
 1:        1   33533
 2:        2   33067
 3:        3   33538
 4:        4   33053
 5:        5   33231
 6:        6   33002
 7:        7   33369
 8:        8   33353
 9:        9   33485
10:       10   33225

It gets messier when taking the position ahead into account. This is a query to count the occurrences of the first unique word across all documents unless it is followed by the second unique word. First I create a new column containing the word one position ahead and then key on both words.

setkey(DT,docindex,position)
DT[,lead_wordindex:=DT[list(docindex,position+1)][,wordindex]]

         wordindex docindex position lead_wordindex
      1:         1        1        1              1
      2:         1        1        2              3
      3:         3        1        3              3
      4:         3        1        4              1
      5:         1        1        5              2
     ---                                           
 999996:         2       10    99811              2
 999997:         2       10    99812              3
 999998:         3       10    99813              1
 999999:         1       10    99814              3
1000000:         3       10    99815             NA

setkey(DT,wordindex,lead_wordindex)
countr2<-DT[J(c(1,1),c(1,3)),list(count.1=.N),by=docindex]
countr2

    docindex count.1
 1:        1   22301
 2:        2   21835
 3:        3   22490
 4:        4   21830
 5:        5   22218
 6:        6   21914
 7:        7   22370
 8:        8   22265
 9:        9   22211
10:       10   22190

I have a very large dataset for which the above query fails for memory allocation. As an alternative, we can create this new column for only the relevant subset of data by filtering the original dataset and then joining it back on the desired position:

setkey(DT,wordindex)
filter<-DT[J(1),list(wordindex,docindex,position)]
filter[,lead_position:=position+1]

        wordindex wordindex docindex position lead_position
     1:         1         1        2    99717         99718
     2:         1         1        3    99807         99808
     3:         1         1        4   100243        100244
     4:         1         1        1        1             2
     5:         1         1        1       42            43
    ---                                                    
332852:         1         1       10    99785         99786
332853:         1         1       10    99787         99788
332854:         1         1       10    99798         99799
332855:         1         1       10    99804         99805
332856:         1         1       10    99814         99815

setkey(DT,docindex,position)
filter[,lead_wordindex:=DT[J(filter[,list(docindex,lead_position)])][,wordindex]]

        wordindex wordindex docindex position lead_position lead_wordindex
     1:         1         1        2    99717         99718             NA
     2:         1         1        3    99807         99808             NA
     3:         1         1        4   100243        100244             NA
     4:         1         1        1        1             2              1
     5:         1         1        1       42            43              1
    ---                                                                   
332852:         1         1       10    99785         99786              3
332853:         1         1       10    99787         99788              3
332854:         1         1       10    99798         99799              3
332855:         1         1       10    99804         99805              3
332856:         1         1       10    99814         99815              3

setkey(filter,wordindex,lead_wordindex)
countr2.1<-filter[J(c(1,1),c(1,3)),list(count.1=.N),by=docindex]
countr2.1

    docindex count.1
 1:        1   22301
 2:        2   21835
 3:        3   22490
 4:        4   21830
 5:        5   22218
 6:        6   21914
 7:        7   22370
 8:        8   22265
 9:        9   22211
10:       10   22190

Pretty ugly, I think. In addition, I may want to look more than one word ahead, necessitating the creation of yet another column. The easy but costly way is:

setkey(DT,docindex,position)
DT[,lead_lead_wordindex:=DT[list(docindex,position+2)][,wordindex]]

         wordindex docindex position lead_wordindex lead_lead_wordindex
      1:         1        1        1              1                   3
      2:         1        1        2              3                   3
      3:         3        1        3              3                   1
      4:         3        1        4              1                   2
      5:         1        1        5              2                   3
     ---                                                               
 999996:         2       10    99811              2                   3
 999997:         2       10    99812              3                   1
 999998:         3       10    99813              1                   3
 999999:         1       10    99814              3                  NA
1000000:         3       10    99815             NA                  NA

setkey(DT,wordindex,lead_wordindex,lead_lead_wordindex)
countr23<-DT[J(1,2,3),list(count.1=.N),by=docindex]
countr23

    docindex count.1
 1:        1    3684
 2:        2    3746
 3:        3    3717
 4:        4    3727
 5:        5    3700
 6:        6    3779
 7:        7    3702
 8:        8    3756
 9:        9    3702
10:       10    3744

However, I currently have to use the ugly filter-and-join way because of size.

So the question is, is there an easier and more beautiful way?

UPDATE:

Thanks to Arun and eddi for clean and simple code that solves the problem. On my ~200M row data, this solution works on a simple combination of words in about 10 seconds, which is quite good!

I do have an added issue, however, that makes the vector scan approach less than optimal. Although in the example I am looking for only one word combination, in practice I may have a vector of words to look for in each position. When I change the "==" statements to "%in%" for that purpose (vectors of 100 words or more), the query takes much longer to run. So I would still be interested in a binary search solution if one exists. However, if Arun doesn't know of one, it might not, and I would happily accept his answer.

标签: r data.table
3条回答
何必那么认真
2楼-- · 2020-07-11 08:46

Here's another idea that just sprung to my mind. It requires just creating one more column and uses binary search for subset.

On the DT you've generated from your data, first we'll add the extra column:

# the extra column:
DT[, I := .I]

We need this because we'll setkey on docindex and wordindex. This is the only way we can subset without creating extra columns (at least what I could think of). So, we'll need a way to extract the "original" positions later to check for your condition (hence the I).

After adding the extra column, let's set the key on the two columns mentioned above:

setkey(DT, docindex, wordindex)

Great! The idea from here is to extract the positions where your desired word matches - here that value is 1L. Then, extract all the other words you may (or may not) want to come after this word at the right position. Then, we simply keep (or remove) those indices that satisfy the condition.

Here's a function that'll take care of this. It is by no means complete, but should give you an idea.

foo <- function(DT, doc_key, word_key, rest_key=NULL, match=FALSE) {
    ## note that I'm using 1.9.3, where this results in a vector
    ## if you're using 1.9.2, you'll have to change the joins accordingly
    idx1 = DT[J(doc_key, word_key), I]
    for (i in seq_along(rest_key)) {
        this_key = rest_key[i]
        idx2 = DT[J(doc_key, this_key), I]
        if (match) idx1 = idx1[which((idx1+i) %in% idx2)]
        else idx1 = idx1[which(!(idx1+i) %in% idx2)]
    }
    DT[idx1, .N, by=c(key(DT)[1L])]
}

Here, DT is the data.table to which I column has been added, and then setkey has been called on the two columns as mentioned before.

doc_key basically contains all the unique values in docindex - here 1:10. word_key is basically 1L here. rest_key is the values you'd like to check does not occur at ith position after the position of word_key.

First we extract I for all matches of 1L in idx1 (straightforward). Next, we loop through each value of rest_key and add that position to idx1 = idx1+i and check if that value occurs in idx2. If so, based of whether you like to extract matching or non-matching entries, we'll keep (or remove them).

And at the end of this loop, idx1 should have only the desired entries. Hope this helps. Shown below is a demonstration of the cases already discussed in the other answer.


Let's consider your first scenario. Count of all entries, for each group in docindex where ith position is 1L and i+1th is not 2L. This is basically:

system.time(ans1 <- foo(DT, 1:10, 1L, 2L, FALSE))

#  user  system elapsed 
# 0.066   0.019   0.085 

# old method took 0.12 seconds

#     docindex     N
#  1:        1 22301
#  2:        2 21836
#  3:        3 22491
#  4:        4 21831
#  5:        5 22218
#  6:        6 21914
#  7:        7 22370
#  8:        8 22265
#  9:        9 22211
# 10:       10 22190

What about the second scenario? Here, we'd like the the i+1th and i+2th position to be 2L and 3L, as opposed to the not equal scenario in the earlier case. So, we set match=TRUE here.

system.time(ans2 <- foo(DT, 1:10, 1L, 2:3,TRUE))
#  user  system elapsed 
# 0.080   0.011   0.090 

# old method took 0.22 seconds

#     docindex    N
#  1:        1 3684
#  2:        2 3746
#  3:        3 3717
#  4:        4 3727
#  5:        5 3700
#  6:        6 3779
#  7:        7 3702
#  8:        8 3756
#  9:        9 3702
# 10:       10 3744

It's easy to expand this function. For ex: if you'd like to have i+1th to be equal to 2L but i+2th not equal to 3L then, you can change match to be a vector = length(rest_key) specifying corresponding logical values.

I hope this is fast for your actual case - at least faster than the other version.

HTH

查看更多
Juvenile、少年°
3楼-- · 2020-07-11 09:01

Just create a lead function and use it in your j-expression:

lead <- function(x, n)
    if (n == 0) x else c(tail(x, -n), rep.int(NA, n))

If you'd like to get the count where wordindex at ith position is 1L and i+1th is not 2L, then:

DT[, sum(wordindex == 1L & lead(wordindex, 1L) != 2L, na.rm=TRUE), by=docindex]
#     docindex    V1
#  1:        1 22301
#  2:        2 21835
#  3:        3 22490
#  4:        4 21830
#  5:        5 22218
#  6:        6 21914
#  7:        7 22370
#  8:        8 22265
#  9:        9 22211
# 10:       10 22190

If you'd like to get counts where wordindex at i is 1L, i+1 is 2L and i+2 is 3L, then:

DT[, sum(wordindex == 1L & lead(wordindex, 1L) == 2L & 
          lead(wordindex, 2L) == 3L, na.rm=TRUE), by=docindex]
#     docindex   V1
#  1:        1 3684
#  2:        2 3746
#  3:        3 3717
#  4:        4 3727
#  5:        5 3700
#  6:        6 3779
#  7:        7 3702
#  8:        8 3756
#  9:        9 3702
# 10:       10 3744

Note that there's no need to setkey here.. adhoc-by should work great.


To address the comment:

This solution uses a vector-scan in j as opposed to your binary search approach. But there are different trade-offs here. The code is relatively elegant, easy to read, extend to multiple lags and conditions and maintain compared to the binary search version (as I can't think of a way without creating extra columns). This requires much lesser memory, which is also a limiting factor in your case.

You say "large data", but don't say anything more. Vector scans on entire data (say 20 million rows or 200 million rows) are costly, yes. But, operating on each group, even if it wouldn't give the performance of binary search, shouldn't be much slower. Of course this again depends on the number of groups you've and the number of observations per group. But it's better to benchmark these things and figure out.

I'll leave you to it. Good luck :).

查看更多
一夜七次
4楼-- · 2020-07-11 09:04

It sounds like you simply want:

DT[, sum(wordindex == 1 & c(tail(wordindex, -1), 2) != 2), by = docindex]

I don't see the point of complicating it via joins.

Btw you will get a different answer from yours in some cases, which is either because I don't understand what you want or because your method fails on some edge cases. E.g. try both methods for

DT = data.table(wordindex = c(1,1,2,1,1,2), docindex = c(1,1,2,2,3,3))
查看更多
登录 后发表回答