I see a lot of R codes where a full dataset is first split into a training set and a test set:
# load the iris dataset
# define a 80%/20% train/test split of the dataset
trainIndex <- createDataPartition(iris$Species, p=0.8, list=FALSE)
data_train <- iris[trainIndex,]
data_test <- iris[-trainIndex,]
In a second time, a partition method is defined such as repeated k-fold cross validation:
train_control <- trainControl(method="repeatedcv", number=10, repeats=3)
Then a model is trained using the training set:
my_model <- train(Species~., data=data_train, trControl=train_control, method="nb")
Finally, predictions are performed on the test set:
pred_results <- predict(my_model, newdata=data_test)
When using specifically a (repeated) k-fold cross validation method, it seems to me that the training (n=k-1 folds ) and the test (n=1 fold) sets are already inherently defined.
In this case why adding an extra layer of partition by splitting first the full dataset into 80% training and 20% test sets? Is it necessary?