I am trying to detect matches between an open text field (read: messy!) with a vector of names. I created a silly fruit example that highlights my main challenges.
df1 <- data.frame(id = c(1, 2, 3, 4, 5, 6),
entry = c("Apple",
"I love apples",
"appls",
"Bannanas",
"banana",
"An apple a day keeps..."))
df1$entry <- as.character(df1$entry)
df2 <- data.frame(fruit=c("apple",
"banana",
"pineapple"),
code=c(11, 12, 13))
df2$fruit <- as.character(df2$fruit)
df1 %>%
mutate(match = str_detect(str_to_lower(entry),
str_to_lower(df2$fruit)))
My approach grabs the low hanging fruit, if you will (exact matches for "Apple" and "banana").
# id entry match
#1 1 Apple TRUE
#2 2 I love apples FALSE
#3 3 appls FALSE
#4 4 Bannanas FALSE
#5 5 banana TRUE
#6 6 An apple a day keeps... FALSE
The unmatched cases have different challenges:
- The target fruit in cases 2 and 6 are embedded in larger strings.
- The target fruit in 3 and 4 require a fuzzy match.
The fuzzywuzzyR
package is great and does a pretty good job (see page for details on installing python modules).
library(fuzzywuzzyR)
choices <- df2$fruit
word <- df1$entry[3] # "appls"
init_proc = FuzzUtils$new()
PROC = init_proc$Full_process
PROC1 = tolower
init_scor = FuzzMatcher$new()
SCOR = init_scor$WRATIO
init <- FuzzExtract$new()
init$Extract(string = word,
sequence_strings = choices,
processor = PROC,
scorer = SCOR)
This setup returns a score of 80 for "apple" (the highest).
Is there another approach to consider aside from fuzzywuzzyR
? How would you tackle this problem?
Adding fuzzywuzzyR
output:
[[1]]
[[1]][[1]]
[1] "apple"
[[1]][[2]]
[1] 80
[[2]]
[[2]][[1]]
[1] "pineapple"
[[2]][[2]]
[1] 72
[[3]]
[[3]][[1]]
[1] "banana"
[[3]][[2]]
[1] 18