Note: This is a question that I originally posted to the data.table help group. Matt Dowle asked for a more detailed example and I posted this one, but I had trouble with the formatting in e-mail. I already know how to format on SO, so I thought I would post it here instead.
What I am basically trying to do is subset rows from a data.table based on a value in that row as well as a value in a preceding or following row. Right now, I am creating new columns for future and past rows and then keying the data.table on these columns, but this is resource-intensive and onerous.
The below example illustrates the approach I am using now. The example uses words in documents (I use numeric indices for both). I want to subset for a particular word, but only if it is preceded or followed by another word or set of words:
I first create a dummy dataset with ten documents containing one million words. There are three unique words in the set.
library(data.table)
set.seed(1000)
DT<-data.table(wordindex=sample(1:3,1000000,replace=T),docindex=sample(1:10,1000000,replace=T))
setkey(DT,docindex)
DT[,position:=seq.int(1:.N),by=docindex]
wordindex docindex position
1: 1 1 1
2: 1 1 2
3: 3 1 3
4: 3 1 4
5: 1 1 5
---
999996: 2 10 99811
999997: 2 10 99812
999998: 3 10 99813
999999: 1 10 99814
1000000: 3 10 99815
Note that simply counting the occurrences of the first unique word across all documents is easy and beautiful.
setkey(DT,wordindex)
count<-DT[J(1),list(count.1=.N),by=docindex]
count
docindex count.1
1: 1 33533
2: 2 33067
3: 3 33538
4: 4 33053
5: 5 33231
6: 6 33002
7: 7 33369
8: 8 33353
9: 9 33485
10: 10 33225
It gets messier when taking the position ahead into account. This is a query to count the occurrences of the first unique word across all documents unless it is followed by the second unique word. First I create a new column containing the word one position ahead and then key on both words.
setkey(DT,docindex,position)
DT[,lead_wordindex:=DT[list(docindex,position+1)][,wordindex]]
wordindex docindex position lead_wordindex
1: 1 1 1 1
2: 1 1 2 3
3: 3 1 3 3
4: 3 1 4 1
5: 1 1 5 2
---
999996: 2 10 99811 2
999997: 2 10 99812 3
999998: 3 10 99813 1
999999: 1 10 99814 3
1000000: 3 10 99815 NA
setkey(DT,wordindex,lead_wordindex)
countr2<-DT[J(c(1,1),c(1,3)),list(count.1=.N),by=docindex]
countr2
docindex count.1
1: 1 22301
2: 2 21835
3: 3 22490
4: 4 21830
5: 5 22218
6: 6 21914
7: 7 22370
8: 8 22265
9: 9 22211
10: 10 22190
I have a very large dataset for which the above query fails for memory allocation. As an alternative, we can create this new column for only the relevant subset of data by filtering the original dataset and then joining it back on the desired position:
setkey(DT,wordindex)
filter<-DT[J(1),list(wordindex,docindex,position)]
filter[,lead_position:=position+1]
wordindex wordindex docindex position lead_position
1: 1 1 2 99717 99718
2: 1 1 3 99807 99808
3: 1 1 4 100243 100244
4: 1 1 1 1 2
5: 1 1 1 42 43
---
332852: 1 1 10 99785 99786
332853: 1 1 10 99787 99788
332854: 1 1 10 99798 99799
332855: 1 1 10 99804 99805
332856: 1 1 10 99814 99815
setkey(DT,docindex,position)
filter[,lead_wordindex:=DT[J(filter[,list(docindex,lead_position)])][,wordindex]]
wordindex wordindex docindex position lead_position lead_wordindex
1: 1 1 2 99717 99718 NA
2: 1 1 3 99807 99808 NA
3: 1 1 4 100243 100244 NA
4: 1 1 1 1 2 1
5: 1 1 1 42 43 1
---
332852: 1 1 10 99785 99786 3
332853: 1 1 10 99787 99788 3
332854: 1 1 10 99798 99799 3
332855: 1 1 10 99804 99805 3
332856: 1 1 10 99814 99815 3
setkey(filter,wordindex,lead_wordindex)
countr2.1<-filter[J(c(1,1),c(1,3)),list(count.1=.N),by=docindex]
countr2.1
docindex count.1
1: 1 22301
2: 2 21835
3: 3 22490
4: 4 21830
5: 5 22218
6: 6 21914
7: 7 22370
8: 8 22265
9: 9 22211
10: 10 22190
Pretty ugly, I think. In addition, I may want to look more than one word ahead, necessitating the creation of yet another column. The easy but costly way is:
setkey(DT,docindex,position)
DT[,lead_lead_wordindex:=DT[list(docindex,position+2)][,wordindex]]
wordindex docindex position lead_wordindex lead_lead_wordindex
1: 1 1 1 1 3
2: 1 1 2 3 3
3: 3 1 3 3 1
4: 3 1 4 1 2
5: 1 1 5 2 3
---
999996: 2 10 99811 2 3
999997: 2 10 99812 3 1
999998: 3 10 99813 1 3
999999: 1 10 99814 3 NA
1000000: 3 10 99815 NA NA
setkey(DT,wordindex,lead_wordindex,lead_lead_wordindex)
countr23<-DT[J(1,2,3),list(count.1=.N),by=docindex]
countr23
docindex count.1
1: 1 3684
2: 2 3746
3: 3 3717
4: 4 3727
5: 5 3700
6: 6 3779
7: 7 3702
8: 8 3756
9: 9 3702
10: 10 3744
However, I currently have to use the ugly filter-and-join way because of size.
So the question is, is there an easier and more beautiful way?
UPDATE:
Thanks to Arun and eddi for clean and simple code that solves the problem. On my ~200M row data, this solution works on a simple combination of words in about 10 seconds, which is quite good!
I do have an added issue, however, that makes the vector scan approach less than optimal. Although in the example I am looking for only one word combination, in practice I may have a vector of words to look for in each position. When I change the "==" statements to "%in%" for that purpose (vectors of 100 words or more), the query takes much longer to run. So I would still be interested in a binary search solution if one exists. However, if Arun doesn't know of one, it might not, and I would happily accept his answer.
Here's another idea that just sprung to my mind. It requires just creating one more column and uses binary search for subset.
On the
DT
you've generated from your data, first we'll add the extra column:We need this because we'll
setkey
ondocindex
andwordindex
. This is the only way we can subset without creating extra columns (at least what I could think of). So, we'll need a way to extract the "original" positions later to check for your condition (hence theI
).After adding the extra column, let's set the key on the two columns mentioned above:
Great! The idea from here is to extract the positions where your desired word matches - here that value is
1L
. Then, extract all the other words you may (or may not) want to come after this word at the right position. Then, we simply keep (or remove) those indices that satisfy the condition.Here's a function that'll take care of this. It is by no means complete, but should give you an idea.
Here,
DT
is thedata.table
to whichI
column has been added, and thensetkey
has been called on the two columns as mentioned before.doc_key
basically contains all the unique values indocindex
- here 1:10.word_key
is basically 1L here.rest_key
is the values you'd like to check does not occur ati
th position after the position ofword_key
.First we extract
I
for all matches of1L
inidx1
(straightforward). Next, we loop through each value ofrest_key
and add that position toidx1
=idx1+i
and check if that value occurs inidx2
. If so, based of whether you like to extract matching or non-matching entries, we'll keep (or remove them).And at the end of this loop,
idx1
should have only the desired entries. Hope this helps. Shown below is a demonstration of the cases already discussed in the other answer.Let's consider your first scenario. Count of all entries, for each group in
docindex
where ith position is1L
andi+1
th is not 2L. This is basically:What about the second scenario? Here, we'd like the the
i+1
th andi+2
th position to be 2L and 3L, as opposed to the not equal scenario in the earlier case. So, we setmatch=TRUE
here.It's easy to expand this function. For ex: if you'd like to have
i+1
th to be equal to2L
buti+2
th not equal to3L
then, you can changematch
to be a vector =length(rest_key)
specifying corresponding logical values.I hope this is fast for your actual case - at least faster than the other version.
HTH
Just create a
lead
function and use it in yourj-expression
:If you'd like to get the count where
wordindex
ati
th position is 1L andi+1
th is not 2L, then:If you'd like to get counts where
wordindex
ati
is 1L,i+1
is 2L andi+2
is 3L, then:Note that there's no need to
setkey
here..adhoc-by
should work great.To address the comment:
This solution uses a vector-scan in
j
as opposed to your binary search approach. But there are different trade-offs here. The code is relatively elegant, easy to read, extend to multiple lags and conditions and maintain compared to the binary search version (as I can't think of a way without creating extra columns). This requires much lesser memory, which is also a limiting factor in your case.You say "large data", but don't say anything more. Vector scans on entire data (say 20 million rows or 200 million rows) are costly, yes. But, operating on each group, even if it wouldn't give the performance of binary search, shouldn't be much slower. Of course this again depends on the number of groups you've and the number of observations per group. But it's better to benchmark these things and figure out.
I'll leave you to it. Good luck :).
It sounds like you simply want:
I don't see the point of complicating it via joins.
Btw you will get a different answer from yours in some cases, which is either because I don't understand what you want or because your method fails on some edge cases. E.g. try both methods for