Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all articles
Browse latest Browse all 206278

Am I using xgboost() correctly (in R)?

$
0
0

I'm a beginner with machine learning (and also R). I've figured out how to run some basic linear regression, elastic net, and random forest models in R and have gotten some decent results for a regression project (with a continuous dependent variable) that I'm working on.

I've been trying to learning how to use the gradient boosting algorithm and, in particular, the xgboost() command. My results are way worse here, though, and I'm not sure why.

I was hoping someone could take a look at my code and see if there are any glaring errors.

# Create training data with and without the dependent variable
train <- data[1:split, ]
train.treat <- select(train, -c(y))

# Create test data with and without the dependent variable
test <- data[(split+1):nrow(data), ]
test.treat <- select(test, -c(y))

# Load the package xgboost
library(xgboost)

# Run xgb.cv
cv <- xgb.cv(data = as.matrix(train.treat), 
             label = train$y,
             nrounds = 100,
             nfold = 10,
             objective = "reg:linear",
             eta = 0.1,
             max_depth = 6,
             early_stopping_rounds = 10,
             verbose = 0   # silent
)

# Get the evaluation log
elog <- cv$evaluation_log

# Determine and print how many trees minimize training and test error
elog %>% 
  summarize(ntrees.train = which.min(train_rmse_mean),   # find the index of min(train_rmse_mean)
            ntrees.test  = which.min(test_rmse_mean))    # find the index of min(test_rmse_mean)


# The number of trees to use, as determined by xgb.cv
ntrees <- 25

# Run xgboost
model_xgb <- xgboost(data = as.matrix(train.treat), # training data as matrix
                          label = train$y,  # column of outcomes
                          nrounds = ntrees,       # number of trees to build
                          objective = "reg:linear", # objective
                          eta = 0.001,
                          depth = 10,
                          verbose = 0  # silent
)

# Make predictions
test$pred <- predict(model_xgb, as.matrix(test.treat))

# Plot predictions vs actual bike rental count
ggplot(test, aes(x = pred, y = y)) + 
  geom_point() + 
  geom_abline()

# Calculate RMSE
test %>%
  mutate(residuals = y - pred) %>%
  summarize(rmse = sqrt(mean(residuals^2)))

How does this look?

Also, one thing I don't get about xgboost() is why I have to take out the dependent variable from the dataset in the "data" option and then add it back in the "label" option. Why do we do this?


Viewing all articles
Browse latest Browse all 206278

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>