I am trying to clean about 2 million entries in a database consisting of job titles. Many have several abbreviations that I wish to change to a single consistent and more easily searchable option. So far I am simply running through the column with individual mapply(gsub(...)
commands. But I have about 80 changes to make this way, so it takes almost 30 minutes to run.
There has got to be a better way. I'm new to string searching, I found the *$
trick, which helped. Is there a way to do more than one search in a single mapply
? I imagine that maybe faster?
Any help would be great. Thanks.
Here is some of the code below. Test is a column of 2 million individual job titles.
test <- mapply(gsub, " Admin ", " Administrator ", test)
test <- mapply(gsub, "Admin ", "Administrator ", test)
test <- mapply(gsub, " Admin*$", " Administrator", test)
test <- mapply(gsub, "Acc ", " Accounting ", test)
test <- mapply(gsub, " Admstr ", " Administrator ", test)
test <- mapply(gsub, " Anlyst ", " Analyst ", test)
test <- mapply(gsub, "Anlyst ", "Analyst ", test)
test <- mapply(gsub, " Asst ", " Assistant ", test)
test <- mapply(gsub, "Asst ", "Assistant ", test)
test <- mapply(gsub, " Assoc ", " Associate ", test)
test <- mapply(gsub, "Assoc ", "Associate ", test)
One option would be to use
mgsub
fromlibrary(qdap)
data
Here is a base R solution which works. You can define a data frame which will contain all patterns and their replacements. Then you use
apply()
in row mode and callgsub()
on yourtest
vector for each pattern/replacement combination. Here is sample code demonstrating this: