Efficient String Search and Replace

I am trying to clean about 2 million entries in a database consisting of job titles. Many have several abbreviations that I wish to change to a single consistent and more easily searchable option. So far I am simply running through the column with individual mapply(gsub(...) commands. But I have about 80 changes to make this way, so it takes almost 30 minutes to run. There has got to be a better way. I'm new to string searching, I found the *$ trick, which helped. Is there a way to do more than one search in a single mapply? I imagine that maybe faster? Any help would be great. Thanks.

Here is some of the code below. Test is a column of 2 million individual job titles.

test <- mapply(gsub, " Admin ", " Administrator ", test)
test <- mapply(gsub, "Admin ", "Administrator ", test)
test <- mapply(gsub, " Admin*$", " Administrator", test)
test <- mapply(gsub, "Acc ", " Accounting ", test)
test <- mapply(gsub, " Admstr ", " Administrator ", test)
test <- mapply(gsub, " Anlyst ", " Analyst ", test)
test <- mapply(gsub, "Anlyst ", "Analyst ", test)
test <- mapply(gsub, " Asst ", " Assistant ", test)
test <- mapply(gsub, "Asst ", "Assistant ", test)
test <- mapply(gsub, " Assoc ", " Associate ", test)
test <- mapply(gsub, "Assoc ", "Associate ", test)

标签： regex r performance data-cleansing

2条回答

放我归山

2楼-- · 2019-08-04 13:37

One option would be to use mgsub from library(qdap)

mgsub(patternVec, replaceVec, test)

data

patternVec <- c(" Admin ", "Admin ")
replaceVec <- c(" Administrator ",  "Administrator ")

0人赞添加讨论(0) 举报

【Aperson】

3楼-- · 2019-08-04 13:51

Here is a base R solution which works. You can define a data frame which will contain all patterns and their replacements. Then you use apply() in row mode and call gsub() on your test vector for each pattern/replacement combination. Here is sample code demonstrating this:

df <- data.frame(pattern=c(" Admin ", "Admin "),
                 replacement=c(" Administrator ", "Administrator "))

test <- c(" Admin ", "Admin ")

apply(df, 1, function(x) {
                test <<- gsub(x[1], x[2], test)
             })

> test
[1] " Administrator " "Administrator "

0人赞添加讨论(0) 举报

Efficient String Search and Replace

data

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间