I have two large datasets, one around half a million records and the other one around 70K. These datasets have address. I want to match if any of the address in the smaller data set are present in the large one. As you would imagine address can be written in different ways and in different cases so it is quite annoying to see that there is not a match when it should have matched and there is a match when it should not have matched. I did some research and figured out the package stringdist that can be used. However I am stuck and I feel I am not using to its fullest capabilities and some suggestions on this would help.
Below is a sample dummy data along with code that I have created to explain the situation
Address1 <- c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR")
df1 <- data.table(Address1)
Address2 <- c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR")
df2 <- data.table(Address2)
df1[, key_match := gsub("[^[:alnum:]]", "", Address1)]
df2[, key_match := gsub("[^[:alnum:]]", "", Address2)]
fn_match = function(str, strVec, n){
strVec[amatch(str, strVec, method = "dl", maxDist=n,useBytes = T)]
}
df1[!is.na(key_match)
, address_match :=
fn_match(key_match, df2$key_match,3)
]
If you see the output it gives me the matches under address_match in df1. If I apply the same code on my main data, the code is still running from last 30 hours. Though I have converted to data.table. Not sure how I can speed this up.
I was doing further reading and came across stringdist matrix. This seems to be more helpful and I can split the address basis the space and check for presence of each word in each address list and depending upon the maximum match one can create the summary of matches. However I am not very good at loops. How do I loop thru each address from the smaller file for each word and check in individual address in the larger file and create matrix of matches? Any help!!
I have a solution that does not require data.table
but if the set is huge could run with package:parallel
rbind.pages(
parallel::mclapply(Address1, function(i){
data.frame(
src = i,
match = Address2[which.min(adist(i, Address2))]
)
}, mc.cores = parallel::detectCores() - 2)) %>%
select(`src (Address1)`= 1, `match (Address2)` = 2)
Which then gives the output solution:
src (Address1) match (Address2)
1 786, GALI NO 5, XYZ 786, GALI NO 4 XYZ
2 rambo, 45, strret 4, atlast, pqr del, 546, strret2, towards east, pqr
3 23/4, 23RD FLOOR, STREET 2, ABC-E, PQR 23/4, STREET 2, PQR
4 45-B, GALI NO5, XYZ 45B, GALI NO 5, XYZ
5 HECTIC, 99 STREET, PQR 23/4, STREET 2, PQR
Edit:
I realized that this may not be very helpful without seeing the distance computations so that you may tweak for your needs ; so I replicated the data into larger random sets and then amended the function to show the string distance computations and the processing time
rand_addy_one <- rep(Address1, 1000)[sample(1:1000, 1000)]
rand_addy_two <- rep(Address2, 3000)[sample(1:3000, 3000)]
system.time({
test_one <<- rbind.pages(parallel::mclapply(rand_addy_one, function(i) {
calc <- as.data.frame(drop(attr(adist(i, rand_addy_two, counts = TRUE), "counts")))
calc$totals <- (rowSums(calc))
calc %>% mutate(src = i, target = rand_addy_two) %>%
filter(totals == min(totals))
}, mc.cores = parallel::detectCores() - 2)) %>%
select(`source Address1` = src, `target Address2(matched)` = target,
insertions = ins, deletions = del, substitutions = sub,
total_approx_dist = totals)
})
user system elapsed
24.940 1.480 3.384
> nrow(test_one)
[1] 600000
Now to reverse and apply the larger set to the smaller:
system.time({
test_two <<- rbind.pages(parallel::mclapply(rand_addy_two, function(i) {
calc <- as.data.frame(drop(attr(adist(i, rand_addy_one, counts = TRUE), "counts")))
calc$totals <- (rowSums(calc))
calc %>% mutate(src = i, target = rand_addy_one) %>%
filter(totals == min(totals))
}, mc.cores = parallel::detectCores() - 2)) %>%
select(`source Address2` = src, `target Address1(matched)` = target,
insertions = ins, deletions = del, substitutions = sub,
total_approx_dist = totals)
})
user system elapsed
27.512 1.280 4.077
nrow(test_two)
[1] 720000