I have a very large (10 million row x 12 column) comma-delimited text file. The first column contains UNIX times (in seconds to 2 d.p.)
I would like to extract all rows corresponding to a particular date (e.g. 2014-06-26), and save the rows for each date in other smaller files.
In the below I scan through the file, reading in the first number in each row (the time), and spit out the row number whenever the date associated with the current row differs from the previous row:
## create fake data ; there are many duplicate times, rows are not always in order
con <- "BigFile.txt"; rile.remove(con)
Times <- seq ( 1581259391, 1581259391 + (7*24*3600), by=100)
write.table(data.frame(Time=Times, x=runif(n = length(Times))), file=con, sep=",", row.names=F, col.names=F, append=F)
## read in fake data line-by-line, note
con <- file( "BigFile.txt", open="r")
Row <- 0
Now <- 0
Last <- 0
while (length(myLine <- scan(con,what="numeric",nlines=1,sep=',',skip=1,quiet=TRUE)) > 0 )
{Row <- Row+1
Now <- as.Date(as.POSIXct( as.numeric(myLine[1]), origin="1970-01-01", tz="GMT" ) , format="%Y-%m-%d")
if (Now!=Last) {print(data.frame(Row,Now))}
Last <- Now
}
The idea would then be to save these indices, and use them to cut up the file into smaller daily chunks... However, I am sure there must be much more efficient approaches (I have tried opening these files using the data.table package, but still run into memory issues).
Any pointers will be greatly appreciated.