I am looking for an r function to create clusters in my dataset based on two variables (hope "cluster" is the right name for what I want to do) . Every two observations with the same value for variable_1 OR variable_2, should be in the same cluster. In the brief exemple that follows, I aggregating dataframe df based on variable_1 and variable_2.
df <- data.frame(variable_1=c("a","a","b","b","c","c","d","d","e","e"),variable_2=c("g1","g2","g1","g3","g2","g4","g4","g6","g7","g8"),value=rnorm(10))
df$clusters <- some_function_to_create_clusters(df[,c("variable_1","variable_2")])
The result should look like:
df$clusters <- c("clu1","clu1","clu1","clu1","clu1","clu1","clu1","clu1","clu2","clu2")
df
Notice that the first cluster contains everyone with variable_1 equals "a", "b", "c", or "d": "a" and "b" are merged together because they share "g1" (lines 1 and 3); "a" and "c" are merged because they share "g2" (lines 2 and 5); and "c" and "d" are merged because they share "g4" (lines 6 and 7). Finally, in the last cluster there are only observations with variable_1=="e", because they don't share variable_2 with anyone.
Just to clarify what I intend to do, I will explain my problem set a little better. I’m pairing counties with nearby Touristic Attractions. Different counties are surround by different Touristic Attraction (TA), and there are many TA around the same county. But this "touristic-clusters" of counties and TAs are sparsely spread in the country. Notice that some far away counties might be inside the same cluster due to a “chain” effect of county-Touristic Attraction connections. So I want to find those “clusters” based on the id of the county and of the Touristic Attraction.
This seems simple, but I can't figure it out how to implement.
Thanks a lot