I am trying to recreate a random forest model from a paper, and the code doesnt seem to work, i am only just learning R and this is very much over my head, but i will try to explain as best I can.
The source code from the paper can be found here: [(https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0181347.s002&type=supplementary)]
The paper supplies two datasets: training and test, and then creates two subsets for each dataset (see bottom of text for head() of the data. Data can be found in supplementary paper here
(should be able to directly copy to a .csv) code is below:
sink("test.txt", split=TRUE)
print("#data process")
data_bin_train<-read.csv("training.csv", head=TRUE)
names(data_bin_train)
data_bin_test<-read.csv("test.csv", head=TRUE)
names(data_bin_test)
dspt_bin_train<-subset(data_bin_train,select=c(-Deamidation))
dspt_bin_test<-subset(data_bin_test,select=c(-Deamidation))
class_bin_train<-subset(data_bin_train, select=c(Deamidation))
class_bin_test<-subset(data_bin_test, select=c(Deamidation))
library("caret")
library("ROCR")
library("pROC")
fitControl <- trainControl(method = "CV",number = 10,returnResamp = "all", verboseIter = FALSE, classProbs = TRUE)
set.seed(2)
this bit works fine. Then the next bit of code is where i get the error:
library("randomForest")
print("#Random Forest binary class via caret (randomForest)")
caret_rf_bin_randomf_cv10 <- train(Deamidation~., data=data_bin_train, method = "rf", preProcess = c("center", "scale"), tuneLength = 10, trControl = fitControl)
caret_rf_bin_randomf_cv10
varImp(caret_rf_bin_randomf_cv10)
rf_bin_Preds <- extractPrediction(list(caret_rf_bin_randomf_cv10),testX=dspt_bin_test[,1:13], testY=class_bin_test[,1])
Error in [.data.frame
(newdata, , object$method$center, drop = FALSE) : undefined columns selected`
Any help would be amazing! The paper used R v 3.1.1 caret_6.0-35, whereas i am running updated versions of both, which is where i believe the error is coming from, but i'm not sure how to fix it, or to be honest what the error even is.
Thank you
TinoMass
below is the `sessionInfo() and Head() for the two data sets
R version 3.5.3 (2019-03-11)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)
Matrix products: default
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] randomForest_4.6-14 pROC_1.15.3 ROCR_1.0-7 gplots_3.0.1.1 caret_6.0-84
[6] ggplot2_3.2.1 lattice_0.20-38
loaded via a namespace (and not attached):
[1] Rcpp_1.0.3 pillar_1.4.3 compiler_3.5.3 gower_0.2.1 plyr_1.8.5 bitops_1.0-6
[7] iterators_1.0.12 class_7.3-15 tools_3.5.3 rpart_4.1-13 ipred_0.9-9 lubridate_1.7.4
[13] lifecycle_0.1.0 tibble_2.1.3 nlme_3.1-137 gtable_0.3.0 pkgconfig_2.0.3 rlang_0.4.2
[19] Matrix_1.2-15 foreach_1.4.7 rstudioapi_0.10 prodlim_2019.11.13 e1071_1.7-3 withr_2.1.2
[25] stringr_1.4.0 dplyr_0.8.3 caTools_1.17.1.3 gtools_3.8.1 generics_0.0.2 recipes_0.1.8
[31] stats4_3.5.3 grid_3.5.3 nnet_7.3-12 tidyselect_0.2.5 data.table_1.12.8 glue_1.3.1
[37] R6_2.4.1 survival_2.43-3 gdata_2.18.0 lava_1.6.6 reshape2_1.4.3 purrr_0.3.3
[43] magrittr_1.5 ModelMetrics_1.2.2 scales_1.1.0 codetools_0.2-16 MASS_7.3-51.1 splines_3.5.3
[49] assertthat_0.2.1 timeDate_3043.102 colorspace_1.4-1 KernSmooth_2.23-15 stringi_1.4.3 lazyeval_0.2.2
[55] munsell_0.5.0 crayon_1.3.4
training.txt
PDB `Residue #` `AA following A… attack_distance Half_life norm_B_factor_C norm_B_factor_CA norm_B_factor_CB norm_B_factor_CG
<chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 11BG 67 GLY 3.84 1.02 1.46 1.46 1.36 1.38
2 11BG 17 SER 4.81 11.8 0.692 0.706 1.18 1.62
3 11BG 71 CYS 4.11 55.5 0.174 0.481 0.574 0.782
4 11BG 44 THR 3.33 49.9 -1.24 -1.30 -1.35 -1.52
5 11BG 94 CYS 4.97 60 1.41 1.64 1.92 2.15
6 11BG 27 LEU 4.52 119 -0.898 -0.905 -0.820 -0.604
test.txt
PDB `Residue #` `AA following A… attack_distance Half_life norm_B_factor_C norm_B_factor_CA norm_B_factor_CB norm_B_factor_CG
<chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1ACC 713 GLY 3.69 1.45 3.17 3.35 3.63 4.06
2 1ACC 719 GLY 4.64 1.04 0.688 0.865 1.42 1.83
3 1ACC 28 PHE 4.81 72.4 1.03 1.06 1.58 1.95
4 1ACC 52 ILE 4.73 279 0.944 1.13 1.29 1.46
5 1ACC 85 HIS 3.60 9.7 0.780 0.800 1.16 1.57
6 1ACC 104 LYS 4.51 55.5 2.22 2.47 2.69 2.91