Given a vector of strings texts
and a vector of patterns patterns
, I want to find any matching pattern for each text.
For small datasets, this can be easily done in R with grepl
:
patterns = c("some","pattern","a","horse")
texts = c("this is a text with some pattern", "this is another text with a pattern")
# for each x in patterns
lapply( patterns, function(x){
# match all texts against pattern x
res = grepl( x, texts, fixed=TRUE )
print(res)
# do something with the matches
# ...
})
This solution is correct, but it doesn't scale up. Even with moderately bigger datasets (~500 texts and patterns), this code is embarassingly slow, solving only about 100 cases per sec on a modern machine - which is ridiculous considering that this is a crude string partial matching, without regex (set with fixed=TRUE
). Even making the lapply
parallel does not solve the issue.
Is there a way to re-write this code efficiently?
Thanks, Mulone
Use
stringi
package - it's even faster than grepl. Check the benchmarks! I used text from @Martin-Morgan postHave you accurately characterized your problem and the performance you're seeing? Here are the Complete Works of William Shakespeare and a query against them
which seems to be much more performant than you imply?
We're expecting linear scaling with both the length (number of elements) of pattern and text. It seems I mis-remember my Shakespeare