Find matching strings between two vectors in R

2019-02-18 06:09发布

问题:

I have two vectors in R. I want to find partial matches between them.

My Data

The first one is from a dataset named muc, which contains 6400 street names. muc$name looks like:

muc$name = c("Berberichweg", "Otto-Klemperer-Weg", "Feldmeierbogen" , "Altostraße",...)

The other vector is d_vector. It contains around 1400 names.

d_vector = "Abel", "Abendroth", "von Abercron", "Abetz", "Abicht", "Abromeit", ...

I want to find all the street names, that contain a name from d_vector somewhere in the street name.

First, I made some general adaptions after importing the csv data (as variable d):

d_vector <- unlist(d$name) d_vector <- as.vector(as.matrix(d_vector))

What I tried so far

  • Then I tried to find a solution with grep, turning d_vector into containing one long string, separated by | for RegEx-Search:

result <- unique(grep(paste(d_vector, collapse="|"), muc$Name, value=TRUE, ignore.case = TRUE)) result

But the result returns all the street names.

  • I also tried to use agrep, which retuned a Out of memory-Error.

  • When I tried d_vector %in% muc$nameit returned just one TRUE and hundreds of FALSE, which doesn't seem right.

Do you have any suggestion where my mistake could lay or which library I could use? I am looking for something like python's "fuzzywuzzy" for R

回答1:

Simple solution:

streets = c("Berberichweg", "Otto-Klemperer-Weg", "Feldmeierbogen" , "Altostraße")
streets = tolower(streets) #Lowercase all
names = c("Berber", "Weg")
names = tolower(names)

sapply(names, function (y) sapply(streets, function (x) grepl(y, x)))

#                   berber   weg
#berberichweg        TRUE  TRUE
#otto-klemperer-weg  FALSE TRUE
#feldmeierbogen      FALSE FALSE
#altostraße          FALSE FALSE


回答2:

In principle, your solution works fine with some dummy data:

streets = c("Berberichweg", "Otto-Klemperer-Weg", "Feldmeierbogen", 
            "Konrad-Adenauer-Platz", "anotherThing")
patterns = c("weg", "platz")

unique(grep(paste(patterns, collapse="|"), streets, value=TRUE, ignore.case = TRUE))
[1] "Berberichweg"          "Otto-Klemperer-Weg"    "Konrad-Adenauer-Platz"

I think something is not quite in place for the d_vector. Try to check class(d_vector), or dput(d_vector) and paste that here.

You can also try using sapply and see if that will work:

matches =sapply(patterns, function(p) grep(p, streets, value=TRUE, ignore.case = TRUE))
# $weg
# [1] "Berberichweg"       "Otto-Klemperer-Weg"
# 
# $platz
# [1] "Konrad-Adenauer-Platz"

unique(unlist(matches))
# [1] "Berberichweg"          "Otto-Klemperer-Weg"    "Konrad-Adenauer-Platz"