I'm new to R, and I haven't seen this discussed anywhere, so I'm only 95% confident of my results.
Reading the agnes()
documentation, I see that in the input "matrix or data frame, each row corresponds to an observation, and each column corresponds to a variable. All variables must be numeric."
I have been working with a large dataset containing survey responses from 144 countries. I created a data frame with a Country name column and several columns representing the averages of normalized numeric variables (range 0-1) for that country (one row per country). I used that data frame as the input to agnes()
, and I noticed that the resulting dendrogram showed the countries in alphabetical order.
Code:
Calculate the average value for each numeric variable for each country.
wm <- aggregate(wd2[!names(wd2) %in% c("Country")], list(Country=wd2$Country), mean)
Create dendrogram.
w_dendc <- agnes(wm,method="complete",diss = FALSE, stand = FALSE)
pltree(w_dendw,labels = wm$Country)
This made no sense, so I scoured the documentation and found the quote above. It certainly looked like agnes()
converted the country names into numeric values 1-144. Because that was by far the largest distance measure, it overwhelmed the other variables and resulted in the alphabetized result.
So I tried again leaving out the Country column and simply using the Country as the label source in pltree()
. When I did that, the dendrogram had some very interesting features, and the countries seemed to be grouped by some combination of geographic proximity and educational/economic attainment.
Create dendrogram, leaving out the Country column (first column).
w_dendc <- agnes(wm[,-1],method="complete",diss = FALSE, stand = FALSE)
pltree(w_dendw,labels = wm$Country)
I'm simply asking for validation that my interpretation is correct about agnes()
and the use of factor variables like this.
Thanks!