While working with predict.knn3
I bumped into an interesting data-wrangling-ish use-case. I didn't know I could call predict using the argument type="class"
to get the predicted levels, exactly what I needed. Therefore, I worked out a somewhat involved solution to select from each predict()
's result row, the level having the maximum probability. The problem was due to the names
function not working in "vectorized" form with a matrix but only with vectors.
To illustrate the use-case before and after finding out about the type="class"
argument:
rm(list = ls())
library(caret)
library(tidyverse)
library(dslabs)
data("tissue_gene_expression")
x <- tissue_gene_expression$x
y <- tissue_gene_expression$y
set.seed(1)
test_index <- createDataPartition(y, times = 1, p = 0.5, list = FALSE)
test_x <- x[test_index,]
test_y <- y[test_index]
train_x <- x[-test_index,]
train_y <- y[-test_index]
# fit the model, predict without type="class" and use sapply to build the y_hat levels
fit <- knn3(train_x, train_y, k = 1)
pred <- predict(fit, test_x)
y_hat <- sapply(1:nrow(pred), function(i) as.factor(names(pred[i,which.max(pred[i,])])))
# compare it to the solution using predict with type="class"
identical(y_hat, as.factor(predict(fit, test_x, type="class")))
[1] TRUE
To illustrate the issue I can do the following, see that the names function operating on a vector of named numeric elements produces the desired result whereas with a matrix will fail with NULL output:
names(pred[1, which.max(pred[1,])])
[1] "cerebellum"
names(pred[1:2, which.max(pred[1:2,])])
NULL
Assuming being unaware of this convenient type="class"
in the predict.knn3
function; is there a simpler way using tidyverse and dplyr to replace this sapply with? Or any other simpler way to implement this use-case?
y_hat <- sapply(1:nrow(pred), function(i) as.factor(names(pred[i, which.max(pred[i,])])))
I'm after something like the following but it doesn't work:
as_tibble(predict(fit, test_x)) %>% mutate(y_hat=names(which.max(.[row_number(),])))