可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have two large datasets, one around half a million records and the other one around 70K. These datasets have address. I want to match if any of the address in the smaller data set are present in the large one. As you would imagine address can be written in different ways and in different cases / spellings etc. Apart from this address can be duplicated if written only till the building level. So different flats have the same address. I did some research and figured out the package stringdist that can be used.
I did some work and managed to get the closest match based on distance. However I am not able to return the corresponding columns for which the address match.
Below is a sample dummy data along with code that I have created to explain the situation
library(stringdist)
Address1 <- c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr")
Year1 <- c(2001:2007)
Address2 <- c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR")
Year2 <- c(2001:2010)
df1 <- data.table(Address1,Year1)
df2 <- data.table(Address2,Year2)
df2[,unique_id := sprintf("%06d", 1:nrow(df2))]
fn_match = function(str, strVec, n){
strVec[amatch(str, strVec, method = "dl", maxDist=n,useBytes = T)]
}
df1[!is.na(Address1)
, address_match :=
fn_match(Address1, df2$Address2,3)
]
This returns me the closed string match based on distance of 3, however I wanted to also have columns of "Year" and "unique_id" from df2 in df1. This would help me to know with which row of data the string was matched from df2. So finally I want to know for each row in df1 what was the closet match from df2 based on the distance specified and have for the matching rows the specific "Year" and "unique_id" from df2.
I guess there is something to do with merge (left join), but I am not sure how I can merge keeping the duplicates and ensuring that I have same number of rows as in df1 (small data set).
Any kind of solution would help!!
回答1:
You are 90% of the way there...
You say you want to
know with which row of data the string was matched from df2
You just need to understand the code you already have. See ?amatch
:
amatch
returns the position of the closest match of x
in table
. When multiple matches with the same smallest distance metric exist, the first one is returned.
In other words, amatch
gives you the index for the row in df2
(which is your table
) that is the closest match of each address in df1
(which is your x
). You are prematurely wrapping this index by returning the new address instead.
Instead, retrieve either the index itself for lookup or the unique_id (if you are confident that it is truly a unique id) for a left join.
Illustration of both approaches:
library(data.table) # you forgot this in your example
library(stringdist)
df1 <- data.table(Address1 = c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr"),
Year1 = 2001:2007) # already a vector, no need to combine
df2 <- data.table(Address2=c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR"),
Year2=2001:2010)
df2[,unique_id := sprintf("%06d", .I)] # use .I, it's neater
# Return position from strVec of closest match to str
match_pos = function(str, strVec, n){
amatch(str, strVec, method = "dl", maxDist=n,useBytes = T) # are you sure you want useBytes = TRUE?
}
# Option 1: use unique_id as a key for left join
df1[!is.na(Address1) | nchar(Address1>0), # I would exclude only on NA_character_ but also empty string, perhaps string of length < 3
unique_id := df2$unique_id[match_pos(Address1, df2$Address2,3)] ]
merge(df1, df2, by='unique_id', all.x=TRUE) # see ?merge for more options
# Option 2: use the row index
df1[!is.na(Address1) | nchar(Address1>0),
df2_pos := match_pos(Address1, df2$Address2,3) ]
df1[!is.na(df2_pos), (c('Address2','Year2','UniqueID')):=df2[df2_pos,.(Address2,Year2,unique_id)] ][]
回答2:
Here is a solution using the fuzzyjoin
package. It uses dplyr
-like syntax and stringdist
as one of the possible types of fuzzy matching.
You can use stringdist
method="dl" (or others that might work better).
To meet your requirement of "ensuring that I have same number of rows as in df1", I used a large max_dist and then used dplyr::group_by
and dplyr::top_n
to get only the best match with minimum distance. This was suggested by dgrtwo, the developer of fuzzyjoin
. (Hopefully it'll be part of the package itself in the future.)
(I also had to make an assumption to take the max year2 in the event of distance ties.)
Code:
library(data.table, quietly = TRUE)
df1 <- data.table(Address1 = c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr"),
Year1 = 2001:2007)
df2 <- data.table(Address2=c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR"),
Year2=2001:2010)
df2[,unique_id := sprintf("%06d", .I)]
library(fuzzyjoin, quietly = TRUE); library(dplyr, quietly = TRUE)
stringdist_join(df1, df2,
by = c("Address1" = "Address2"),
mode = "left",
method = "dl",
max_dist = 99,
distance_col = "dist") %>%
group_by(Address1, Year1) %>%
top_n(1, -dist) %>%
top_n(1, Year2)
Result:
# A tibble: 7 x 6
# Groups: Address1, Year1 [7]
Address1 Year1 Address2 Year2 unique_id dist
<chr> <int> <chr> <int> <chr> <dbl>
1 786, GALI NO 5, XYZ 2001 786, GALI NO 4 XYZ 2007 000007 2
2 rambo, 45, strret 4, atlast, pqr 2002 del, 546, strret2, towards east, pqr 2009 000009 17
3 23/4, 23RD FLOOR, STREET 2, ABC-E, PQR 2003 23/4, STREET 2, PQR 2010 000010 19
4 45-B, GALI NO5, XYZ 2004 45B, GALI NO 5, XYZ 2008 000008 2
5 HECTIC, 99 STREET, PQR 2005 23/4, STREET 2, PQR 2010 000010 11
6 786, GALI NO 5, XYZ 2006 786, GALI NO 4 XYZ 2007 000007 2
7 rambo, 45, strret 4, atlast, pqr 2007 del, 546, strret2, towards east, pqr 2009 000009 17