I am trying to do predict future site traffic based on survey data (iv1, iv2, iv3, etc). The model is supposed to train on the previous data, and predict future site traffic using data already obtained.
The regression includes the full dataframe and accounts for the future site traffic data that's missing with q. The testing data is only for the months of January and February, and has all the survey data, but does not have the site traffic because it hasn't happened yet.
My data looks a little like this.:
date <- c(3-28-2019, 4-28-2019, 5-28-2019, 6-28-2019, 7-28-2019, 8-28-2019,
9-28-2019, 10-28-2019, 11-28-2019, 12-28-2019, 1-28-2020, 2-28-2020)
sitetraffic <- c(80, 99, 70, 65, 88, 90, 76, 65, 67, 68, NA, NA)
iv1 <- c(82, 93, 72, 61, 89, 93, 71, 63, 64, 65, 82, 62)
iv2 <- c(80, 99, 82, 62, 70, 65, 88, 90, 76, 93, 71, 99)
iv3 <- c(71, 63, 64, 71, 99, 76, 65, 67, 93, 72, 68, 89)
#etc
Here is the code:
q = !is.na(d$revenue) #q handles the extra NA values in the future site traffic
lm = lm(sitetraffic ~ iv1 + iv2 + iv3 + iv4 + iv5 + iv6 + iv7 + iv8 + iv9
+ iv10 + iv11 + iv12 + iv13 + iv14, data = dataframe, q)
fcast <- predict(lm, test)
This code predicts as expected when I only include around 10 independent variables, but then I just get NA's if I use anymore. I have also tried predict.lm() and prediction(), but neither worked with all the iv's. Any other more powerful versions of predict() that can handle more iv's?