I have a dataset with 18 variables and 30k+ observations, with data being collected from sensors every 15min. Some of the variables collect the temperature and these will be used as independent variables to predict the state of a structure. During this period some data is missing due to the sensors being offline and some other observations were removed because they were outliers.
Is there a way to interpolate the missing values in a fairly accurate way? I've been trying to create a dataset with all time values every 15min and then matching with the dataset with all values
df$Hour <- substr(df$Time,15,16)
df$Hour <- replace(df$Hour, df$Hour > 0 & df$Hour < 15, "00")
df$Hour <- replace(df$Hour, df$Hour > 15 & df$Hour < 30, 15)
df$Hour <- replace(df$Hour, df$Hour > 30 & df$Hour < 45, 30)
df$Hour <- replace(df$Hour, df$Hour > 45 & df$Hour < 60, 45)
library(lubridate)
df$Time <- ymd_hm(paste(substr(df$Time,1,14),df$Hour,sep=""))
allDates <- seq(ISOdate(2018,4,27,14,15), ISOdate(2018,9,10,19,45), by = "15 min")
allValues <- merge(data.frame(Time=allDates),df,all.x=TRUE)
After this I try to use the Amelia package but it return an error: "There are observations in the data that are completely missing"amelia(df, m = 5, p2s = 1, frontend = FALSE, idvars = "id",
ts = "Time", cs = NULL, polytime = NULL, splinetime = NULL, intercs = FALSE,
lags = 2:17, leads = 2:17, startvals = 0, tolerance = 0.0001,
logs = NULL, sqrts = NULL, lgstc = NULL, noms = NULL, ords = NULL,
incheck = TRUE, collect = FALSE, arglist = NULL, empri = 0.1*nrow(DadosTS),
priors = NULL, autopri = 0.05, emburn = c(0,0), bounds = NULL,
max.resample = 100, overimp = NULL, boot.type = "ordinary",
ncpus = getOption("amelia.ncpus", 1L), cl = NULL)
Any method to deal with this problem / use the available data to create a time-series model would be very much appreciated