I am running Kmeans algorithm in R on Heart Disease UCI dataset. I am supposed to get 2 clusters with 138 165 size for each like what in the data set.
Steps:
- Store dataset in a data frame:
df <- read.csv(".../heart.csv",fileEncoding = "UTF-8-BOM")
- Extract the features:
features = subset(df, select = -target)
- Normalize it:
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
features = data.frame(sapply(features, normalize))
- Run the algorithm:
set.seed(0)
cluster = kmeans(features, 2)
cluster$size
Output:
[1] 99 204
Why?