I have a long vector of characters (e.g. "Hello World", etc), 1.7M rows, and I need to substitute words in them using a map between two vectors, and save the result in same vector. Here's a simple example:
library(qdap)
line = c("one", "two one", "four phones")
e = c("one", "two")
r = c("ONE", "TWO")
line = mgsub(e,r,line)
Result:
[1] "ONE" "TWO ONE" "four phONEs"
As you can see, each instance of e[j]
in line gets substituted with r[j]
and only r[j]
.
It works fine on a relatively small "line" and e->r
vocabulary length, but when I run on length(line) = 1700000
and length(e) = 750
, I reach the total allocated memory:
Reached total allocation of 7851Mb: see help(memory.size)
Any ideas how to avoid it?
Update to the problem (to Admins: if it doesn't deserve a separate answer - please merge it with the original one). The reason
mgsub
ran so fast compared to a simple for loop was that inmgsub
the parameterfixed = TRUE
by default, while ingsub
it isFALSE
by default! I just discovered it. I'd like to clarify again, thatfixed=TRUE
is not appropriate for me, as I do not want to replacecaps
incapsule
, but only the whole wordcaps
. I.e. I am forced to paste\\b
s to the pattern. Here are three snippets from my code (I testedfixed=TRUE
ingsub
just to see the time difference, not going to use it).Here are the times and memory usage for all three cases on different number of input data:
Thus, I conclude that for my application when
fixed
must beFALSE
, there's no advantage of usingmgsub
. In fact,for
loop is faster and does not cause memory overflow!Thanks to all involved. I wish I could give commenters credits, but I don't know how to do it in "Comments"
I believe you can use
fixed = TRUE
.You seem to be concerned with spaces it sounds like... so just add spaces to the ends of all 3 vectors you're working with. To run this whole sequence from
## Start
to## Finish
(roughly the size of your data) takesTime difference of 2.906395 secs
on 1.7 million strings. The majority of time is at the end with stripping off the extra spaces.Here qdap's
mgsub
is not useful. The package was designed for much smaller data. Additionally, thefixed = TRUE
is a sensible default because it is so much faster. The point of an add on packages is to improve upon work flow (sometimes field/task specific) through a reconfiguration of available tools. Themgsub
function has some error handling too and other niceties that are useful in the analysis of transcripts that make the function hog memory. There's often the trade off between safety + syntactic sugar vs. speed.Note that just because 2 functions are named in similar ways should not imply anything, particularly if they are found in add on packages. Even functions within base R have differently named and behaving defaults (look at the
apply
family of functions; this problem is less than ideal but is part of the historical evolution of R). It is incumbent upon you as a user to read documentation not make assumptions.The stringi package provides fast consistent tools for lots of string manipulation stuff:
Darn near as fast (fractions of a second different) as the other method and more straight forward.