I have a folder of 10,000+ csv files stored on my hard drive. Each csv is for a species and gives presence in raster cells (so over 5million cells if the species were present in every cell on earth).
I need to read each file and use dplyr to join to other data frames and summarise, then return a summary df. I don't have a server to run this on and it's stalling my desktop. It works with a subset of 17 species csvs, but even then it's slow.
This is similar to a few other questions about dealing with big data, but I can't figure out the right combination of packages like data.table, bigmemory, and future. I think the really slow part is the dplyr commands, as opposed to reading the files, but I'm not sure.
I'm not sure if this is possible to answer without the files, but they're huge so not sure how to make this reproducible?
spp_ids <- <vector of the species ids, in this case 17 of them>
spp_list <- <datafame with ids of the 17 spp in the folder>
spp_info <- <dataframe with the species id and then some other columns>
cellid_df <- <big df with 5 million+ cell ids and corresponding region names>
# Loop
spp_regions <- future_lapply(spp_ids, FUN = function(x) {
csv_file <- file.path("//filepathtoharddrivefolder",
sprintf('chrstoremove_%s.csv', x)) # I pull just the id number from the file names
# summarise number of regions and cells
spp_region_summary <- data.table::fread(csv_file, sep = ",") %>%
dplyr::mutate(spp_id = x) %>%
dplyr::filter(presence == 1) %>% # select cell ids where the species is present
dplyr::left_join(cellid_df, by = "cell_id") %>%
dplyr::group_by(region, spp_id) %>%
dplyr::summarise(num_cells = length(presence)) %>%
dplyr::ungroup()
# add some additional information
spp_region_summary <- spp_region_summary %>%
dplyr::left_join(spp_info, by = "spp_id") %>%
dplyr::left_join(spp_list, by = "spp_id") %>%
dplyr::select(region, spp_id, num_cells)
return(spp_region_summary)
})
spp_regions_df <- dplyr::bind_rows(spp_regions)
fwrite(spp_regions_df,"filepath.csv")
Haven't worked with this much data before so I've never had to leave the tidyverse!