I have a list of addresses. These addresses were input by various users and hence there are lot of differences in the way a same address is written. For example,
"andheri at weh pump house", "andheri pump house","andheri pump house(mt)","weh andheri pump house","weh andheri pump house et","weh, nr. pump house"
The above vector has 6 addresses. And almost all of them are the same. I am trying to find the matches between these address, so that I can club them together and recode them.
I have tried using agrep
and stringdist package. With agrep I am not sure if I should each address as a pattern and match it against the rest. And from the stringdist package I did the following:
library(stringdist)
nsrpatt <- df$Address
x <- scan(what=character(), text = nsrpatt, sep=",")
x <- x[trimws(x)!= ""]
y <- ave(x, phonetic(x), FUN = function(.x) .x[1])
The above gives me the error:
In phonetic(x) : soundex encountered 111 non-printable ASCII or non-ASCII
characters.
Not sure if I should remove those elements from the character vector or convert them to some other format.
With agrep I tried:
for (i in 1:length(nsrpattn)) {
npat <- agrep(nsrpattn[i], df$address, max=1, v=T)
}
The length of the character vector is around 25000 and this keeps running and stalls the machine.
How do I effectively find the closest match for each one of the address.