I'm a beginner with machine learning (and also R). I've figured out how to run some basic linear regression, elastic net, and random forest models in R and have gotten some decent results for a regression project (with a continuous dependent variable) that I'm working on.
I've been trying to learning how to use the gradient boosting algorithm and, in particular, the xgboost() command. My results are way worse here, though, and I'm not sure why.
I was hoping someone could take a look at my code and see if there are any glaring errors.
# Create training data with and without the dependent variable
train <- data[1:split, ]
train.treat <- select(train, -c(y))
# Create test data with and without the dependent variable
test <- data[(split+1):nrow(data), ]
test.treat <- select(test, -c(y))
# Load the package xgboost
library(xgboost)
# Run xgb.cv
cv <- xgb.cv(data = as.matrix(train.treat),
label = train$y,
nrounds = 100,
nfold = 10,
objective = "reg:linear",
eta = 0.1,
max_depth = 6,
early_stopping_rounds = 10,
verbose = 0 # silent
)
# Get the evaluation log
elog <- cv$evaluation_log
# Determine and print how many trees minimize training and test error
elog %>%
summarize(ntrees.train = which.min(train_rmse_mean), # find the index of min(train_rmse_mean)
ntrees.test = which.min(test_rmse_mean)) # find the index of min(test_rmse_mean)
# The number of trees to use, as determined by xgb.cv
ntrees <- 25
# Run xgboost
model_xgb <- xgboost(data = as.matrix(train.treat), # training data as matrix
label = train$y, # column of outcomes
nrounds = ntrees, # number of trees to build
objective = "reg:linear", # objective
eta = 0.001,
depth = 10,
verbose = 0 # silent
)
# Make predictions
test$pred <- predict(model_xgb, as.matrix(test.treat))
# Plot predictions vs actual bike rental count
ggplot(test, aes(x = pred, y = y)) +
geom_point() +
geom_abline()
# Calculate RMSE
test %>%
mutate(residuals = y - pred) %>%
summarize(rmse = sqrt(mean(residuals^2)))
How does this look?
Also, one thing I don't get about xgboost() is why I have to take out the dependent variable from the dataset in the "data" option and then add it back in the "label" option. Why do we do this?