Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all articles
Browse latest Browse all 201839

Optimize calls to mutate and summarise?

$
0
0

I have this R script:

rm(list = ls())

library(tidyr)
suppressWarnings(library(dplyr))
outFile = "zFinal.lua"

cat("\014\n")

cat(file = outFile, sep = "")

filea <- read.csv("csva.csv", strip.white = TRUE)
fileb <- read.csv("csvb.csv", strip.white = TRUE, sep = ";", header=FALSE)

df <-    
    merge(filea, fileb, by.x = c(3), by.y = c(1)) %>%
    subset(select = c(1, 3, 6, 2)) %>%
    arrange(ColA, ColB, V2) %>%
    group_by(ColA) %>%
    mutate(V2 = paste0('"', V2, "#", ColB, '"')) %>%
    summarise(ID = paste(V2, collapse = ", ", sep=";")) %>%
    mutate(ID = paste0('["', ColA, '"] = {', ID, '},')) %>%
    mutate(ID = paste0('\t\t', ID))

df <- df[c("ID")]

cat("\n\tmyTable = {\n", file = outFile, append = TRUE, sep = "\n")
write.table(df, append = TRUE, file = outFile, sep = ",", quote = FALSE, row.names = FALSE, col.names = FALSE)
cat("\n\t}", file = outFile, append = TRUE, sep = "\n")

# Done
cat("\nDONE.", sep = "\n")

As you can see, this script opens csva.csv and csvb.csv.

This is csva.csv:

ID,ColA,ColB,ColC,ColD
2,3,100,1,1
3,7,300,1,1
5,7,200,1,1
11,22,900,1,1
14,27,500,1,1
16,30,400,1,1
20,36,900,1,1
23,39,800,1,1
24,42,700,1,1
29,49,800,1,1
45,3,200,1,1

And this is csvb.csv:

100;file1
200;file2
300;file3
400;file4

This is the output file that my script and the csv files produce:

myTable = {

    ["3"] = {"file1#100", "file2#200"},
    ["7"] = {"file2#200", "file3#300"},
    ["30"] = {"file4#400"},

}

This output file is exactly what I want. It's perfect.

This is what the script does. I'm not sure I can explain this very well so if I don't do a good job at that, please skip this section.

For each line in csva.csv, if ColC (csva) contains a number that is contained in Column 1 (csvb), then the output file should contain a line like this:

["3"] = {"file1#100", "file2#200"},

So, in the above example, the first line in ColA (csva) contains number 3 and colB for that line is 100. In csvb, column 1 contains 100 and column 2 contains file1#100.

Because csva contains another number 3 in ColA (the last line), this is also processed and output to the same line.

Ok so my script runs very well indeed and produces perfect output. The problem is it takes too long to run. csva and csvb in my question here are only a few lines long so the output is instant.

However, the data I have to work with in the real world - csva is over 300,000 lines and csvb is over 900,000 lines. So the script takes a long, long time to run (too long to make it feasible). It does work beautifully but it takes far too long to run.

From commenting out lines gradually, it seems that the slowdown is with mutate and summarise. Without those lines, the script runs in about 30 seconds. But with mutate and summarise, it takes hours.

I'm not too advanced with R so how can I make my script run faster possibly by improving my syntax or providing faster alternatives to mutate and summarise?


Viewing all articles
Browse latest Browse all 201839

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>