I'm dealing with 75GB XML files i.e. it's therefore impossible to load them in memory and build a DOM XML tree. Therefore I resort to processing chunks of lines (using readr::read_lines_chunked
) in blocks of e.g. 10k lines. This is a small demonstration of N=3 lines where I extract the data needed to build the tibble
, but this isn't super fast:
library(tidyverse)
xml <- c("<row Id=\"4\" Attrib1=\"1\" Attrib2=\"7\" Attrib3=\"2008-07-31T21:42:52.667\" Attrib4=\"645\" Attrib5=\"45103\" Attrib6=\"fjbnjahkcbvahjsvdghvadjhavsdjbaJKHFCBJHABCJKBASJHcvbjavbfcjkhabcjkhabsckajbnckjasbnckjbwjhfbvjahsdcvbzjhvcbwiebfewqkn\" Attrib7=\"8\" Attrib8=\"11652943\" Attrib9=\"Rich B\" Attrib10=\"2019-09-03T17:25:25.207\" Attrib11=\"2019-10-21T14:03:54.607\" Attrib12=\"1\" Attrib13=\"a|b|c|d|e|f|g\" Attrib14=\"13\" Attrib15=\"3\" Attrib16=\"49\" Attrib17=\"2012-10-31T16:42:47.213\"/>",
"<row Id=\"5\" Attrib1=\"2\" Attrib2=\"8\" Attrib3=\"2008-07-31T21:42:52.999\" Attrib4=\"649\" Attrib5=\"7634\" Attrib6=\"fjbnjahkcbvahjsvdghvadjhavsdjbaJKHFCBJHABCJKBASJHcvbjavbfcjkhabcjkhabsckajbnckjasbnckjbwjhfbvjahsdcvbzjhvcbwiebfewqkn\" Attrib7=\"8\" Attrib8=\"11652943\" Attrib9=\"Rich B\" Attrib10=\"2019-09-03T17:25:25.207\" Attrib11=\"2019-10-21T14:03:54.607\" Attrib12=\"2\" Attrib13=\"a|b|c|d|e|f|g\" Attrib14=\"342\" Attrib15=\"43\" Attrib16=\"767\" Attrib17=\"2012-10-31T16:42:47.213\"/>",
"<row Id=\"6\" Attrib1=\"3\" Attrib2=\"9\" Attrib3=\"2008-07-31T21:42:52.999\" Attrib4=\"348\" Attrib5=\"2732\" Attrib6=\"djhfbsdjhfbijhsdbfjkdbnfkjndaskjfnskjdlnfkjlsdnf\" Attrib7=\"9\" Attrib8=\"34873\" Attrib9=\"FHDHsf\" Attrib10=\"2019-09-03T17:25:25.207\" Attrib11=\"2019-10-21T14:03:54.607\" Attrib12=\"3\" Attrib13=\"a|b|c|d|e|f|g\" Attrib14=\"342\" Attrib15=\"43\" Attrib16=\"767\" Attrib17=\"2012-10-31T16:42:47.4333\"/>")
pattern <- paste(".*(Id=\"\\d+\") ",
"(Attrib1=\"\\d+\") ",
"(Attrib2=\"\\d+\") ",
"(Attrib3=\"[0-9]+-[0-9]+-[0-9]+T[0-9]+:[0-9]+:[0-9]+[0-9]+.[0-9]+\") ",
"(Attrib4=\"\\d+\") ",
"(Attrib5=\"\\d+\")",
".*(Attrib8=\"\\d+\") ",
".*(Attrib10=\"[0-9]+-[0-9]+-[0-9]+T[0-9]+:[0-9]+:[0-9]+[0-9]+.[0-9]+\") ",
"(Attrib11=\"[0-9]+-[0-9]+-[0-9]+T[0-9]+:[0-9]+:[0-9]+[0-9]+.[0-9]+\")",
".*(Attrib13=\"([a-z]|[0-9]|\\||\\s)+\") ",
"(Attrib14=\"\\d+\") ",
"(Attrib15=\"\\d+\") ",
"(Attrib16=\"\\d+\")",
sep="")
# math the groups in pattern and extract the matches
tmp <- str_match(xml, pattern)[,-c(1,12)]
# remove non matching NA rows
r <- which(is.na(tmp[,1]))
if (length(r) > 0) {
tmp <- tmp[-r,]
}
# remove the metadata and stay with the data within the double quotes only
tmp <- apply(tmp, 1, function(s) {
str_remove_all(str_match(s, "(\".*\")")[,-1], "\"")
})
# need the transposed version of tmp
tmp <- t(tmp)
tmp
# convert to a tibble
colnames(tmp) <- c("Id", "Attrib1", "Attrib2", "Attrib3", "Attrib4", "Attrib5", "Attrib8", "Attrib10", "Attrib11", "Attrib13", "Attrib14", "Attrib15", "Attrib16")
as_tibble(tmp)
Is there a better approach performance-wise?
UPDATE: I benchmarked the code above on 10k lines (instead of 3) and it was 900 seconds. I then reduced the number of attribute regex groups from 13 to 7 (only the critically important ones) and the same benchmark dropped to 128 seconds.
Extrapolating to 9731474 lines I went from ~10 days to ~35 hours. I then split the big file into 6 files using the Linux command split -l1621913 -d Huge.xml Huge_split_ --verbose
to match the number of cores I have and now running the code in parallel on each split file ... so I'm looking at 35/6=~5.8 hours ... which is not too bad. I do:
library(doMC)
registerDoMC(6)
resultList <- foreach (i=0:5) %dopar% {
file <- sprintf('Huge_split_0%d', i)
partial <- # run the chunk algorithm on file
return(partial)
}