Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all articles
Browse latest Browse all 201894

Random Forest Appears to be Overfitting

$
0
0

I am constructing what is known as an 'Expected Goals' model for football. This metric measures shot quality and a probability is assigned to a shot to achieve this, i.e. the chance a shot will be converted. To create this model I am using a random forest classifier. For evaluation purposes I am only interested in the accuracy of the probabilities rather than strictly classifying shots, therefore, I use the predictions to calculate the Mean Square Error where goal = 1 and no goal = 0. The MSE for the test set along with two benchmarks are as follows:

[1] "test.rf_mse: 0.0856633533734135"
[1] "comparison_model_mse: 0.0820007160001345"
[1] "naive_baseline_mse: 0.0912291249164997"

Note that the comparison model should be better than mine but the naive baseline should be worse. When looking at this the model looks to be doing okay. However, when I am apply the same steps to the training set I get the following:

[1] "test.rf_mse: 0.0112001023587005"
[1] "comparison_model_mse: 0.0722459417565357"
[1] "naive_baseline_mse: 0.0858344459279039"

Here the MSE falls to unrealistic levels. Doesn't this mean that my model is overfitting? I understand that the idea that random forests can't overfit isn't strictly correct as all models can overfit to some extent, but for the model to be overfitting by this much must mean I am misunderstanding something here.

#FINAL MODEL FOR SHOTS DATASET
set.seed(5555)
trainIndex <- createDataPartition(shots$goal.miss, p = .75, list = FALSE)
train_set <- shots[ trainIndex,]
test_set  <- shots[-trainIndex,]

set.seed(1000)
rf.shots <- randomForest(as.factor(goal.miss) ~ ., data=train_set, ntree=500, mtry=5)

###TEST RESULTS
pred <- predict(rf.shots, test_set, type="prob")
test_set$predictions.test <- pred[,2]
mean(((test_set$goal.miss - test_set$predictions.test)^2))

###TRAIN RESULTS 
pred <- predict(rf.shots, train_set, type="prob")
train_reserve$predictions.train <- pred[,2]
mean(((test_set$goal.miss - test_set$predictions.train)^2))

Viewing all articles
Browse latest Browse all 201894

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>