Fast grep with a vectored pattern or match, to ret

2019-09-02 01:49发布

问题:

I guess this is trivial, I apologize, I couldn't find how to do it.

I am trying to abstain from a loop, so I am trying to vectorize the process: I need to do something like grep, but where the pattern is a vector. Another option is a match, where the value is not only the first location.

For example data (which is not how the real data is, otherswise I would exploit it structure):

COUNTRIES=c("Austria","Belgium","Denmark","France","Germany",
"Ireland","Italy","Luxembourg","Netherlands",
"Portugal","Sweden","Spain","Finland","United Kingdom")

COUNTRIES_Target=rep(COUNTRIES,times=4066)
COUNTRIES_Origin=rep(COUNTRIES,each=4066)

Now, currently I got a loop that:

var_pointer=list()
for (i in 1:length(COUNTRIES_Origin))
{     
var_pointer[[i]]=which(COUNTRIES_Origin[i]==COUNTRIES_Target)
 }

The problem with match is that match(x=COUNTRIES_Origin,table=COUNTRIES_Target) returns a vector of the same length as COUNTRIES_Origin and the value is the first match, while I need all of them.

The issue with grep is that grep(pattern=COUNTRIES_Origin,x=COUNTRIES_Target) is the given warning: Warning message: In grep(pattern = COUNTRIES_Origin, x = COUNTRIES_Target) : argument 'pattern' has length > 1 and only the first element will be used

Any suggestions?

回答1:

Trying to vectorize MxN matches is fundamentally not very performant, no matter how you do it it's still MN operations.

Use hashes instead for O(1) lookup.

For recommendations on using the hash package, see Can I use a list as a hash in R? If so, why is it so slow?



回答2:

It seems like you can just lapply over the list rather than loop.

lapply(COUNTRIES_Origin, function(x) which(COUNTRIES_Target==x))

Here I use which because grep seems to be better for partial matches and you're looking for exact matches.