Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all articles
Browse latest Browse all 201867

linear regression: comparing R2 in the full model to R2 under k-fold cross-validation

$
0
0

Below is R code that does the following: 1. Generates data for linear regression (4 predictors, multivariate normal data, based on a correlation matrix) 2. Runs 10-fold cross-validation using caret, providing summary R2 results *3. Correlates the predicted values across all the folds with the actual values, then squares it to get the cross-validated R2 -- this is the variable 'ar2' in the code below.

*So my question is #3 above: Why doesn't caret just compute this? Instead, it reports an R2 within each fold, explains variability in R2 across folds, etc. But if I want to know overall out-of sample prediction, based on cross-fold, it seems like #3 above is more direct.


# cross-validated linear regression
library(MASS)
library(caret)

# first generate random normal data
sigma <- matrix(c( 1,  .35, .20, .10, .25, 
                  .35, 1  , .15, .30, .30,
                  .20, .15,  1 , .40, .20,
                  .10, .30, .40, 1  , .35,
                  .25, .30, .20, .35,   1), ncol=5)

d <- mvrnorm(n = 100, rep(0, 5), sigma)

# label variables here
colnames(d) <- c(paste0("x", 1:4),"y")
# look at top of data set
head(d)

# generate means and correlations
apply(d,2,mean)
cor(d)
d <- as.data.frame(d)

# what if we used the whole sample, no cross-validation?
full <- lm(y ~ ., data = d)
summary(full)

# now let's look at cross-validated prediction

data_ctrl <- trainControl(method = "cv", number = 10, savePredictions="all")     # folds for cross-validation
model_caret <- train(y ~ .,   # model to fit - the dot means include all x's
                     data = d,                        
                     trControl = data_ctrl,              # include the folds above
                     method = "lm")                      # specify linear regression                
model_caret           # results from cross-validation
# look at predictions for each fold
model_caret$resample
# summarized results
model_caret$results
# all data put into final model
summary(model_caret) 

# what is the r2 between observed and predicted values?
# get the predicted values across folds
a <- model_caret$pred
# correlate actual and predicted values
ar2 <- cor(a[,1],a[,2])^2
ar2

# ...we can compare this r2 (ar2) from cross-validation to the r2 from the full model
# and get a direct sense of how r2 goes down under cross validation...right?

Viewing all articles
Browse latest Browse all 201867

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>