Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all articles
Browse latest Browse all 205372

R Finding elements matching with each other within a vector

$
0
0

I have a list of addresses. These addresses were input by various users and hence there are lot of differences in the way a same address is written. For example,

"andheri at weh pump house", "andheri pump house","andheri pump house(mt)","weh andheri pump house","weh andheri pump house et","weh, nr. pump house"

The above vector has 6 addresses. And almost all of them are the same. I am trying to find the matches between these address, so that I can club them together and recode them.

I have tried using agrep and stringdist package. With agrep I am not sure if I should each address as a pattern and match it against the rest. And from the stringdist package I did the following:

library(stringdist)
nsrpatt <- df$Address
x <- scan(what=character(), text = nsrpatt, sep=",")
x <- x[trimws(x)!= ""]
y <- ave(x, phonetic(x), FUN = function(.x) .x[1])

The above gives me the error:

In phonetic(x) : soundex encountered 111 non-printable ASCII or non-ASCII
  characters. 

Not sure if I should remove those elements from the character vector or convert them to some other format.

With agrep I tried:

for (i in 1:length(nsrpattn)) {
  npat <- agrep(nsrpattn[i], df$address, max=1, v=T)
}

The length of the character vector is around 25000 and this keeps running and stalls the machine.

How do I effectively find the closest match for each one of the address.


Viewing all articles
Browse latest Browse all 205372

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>