This is a follow-up to my earlier question about how to locate a compare a particular attribute between two text files.
Thanks to the accepted answer I got together below code which works (on my system) on smaller data sets:
library(tidyverse, data.table)
con1 <- file("file1.csv", open = "r")
con2 <- file("file2.csv", open = "r")
file1 <- select(read.csv(con1, sep = "|", fill = F, colClasses = "character"),
PROFILE.ID, USERID)
setDT(file1)
file2 <- select(read.csv(con2, sep = "|", fill = F, colClasses = "character"),
PROFILE.ID, USERID)
setDT(file2)
full_join(file1, file2, by = "USERID") %>%
filter(is.na(PROFILE.ID.x) | is.na(PROFILE.ID.y) |
PROFILE.ID.x != PROFILE.ID.y)
close(con1)
close(con2)
The problem:
When R starts processing the full_join function, it eventually stops with the error cannot allocate vector of size 557.6 Mb
.
The environment: This is a 64-bit R v.3.6.2 on a Windows 10 OS and memory.limit() returns 16222. I don't have any other objects loaded in R except what is loaded by above code.
Probable cause: The problem probably comes from that the two CSV files have about 120K rows and 83 columns each.
What I have tried so far but without resolving the issue:
- Included the use of the select() function to remove the unnecessary columns.
- Included the use of data.table and setDT() to convert the data frames to data tables.
- Closed all apps that have a visible UI (Outlook, Google Chrome, Excel, etc).
Regardless of what I have tried the error always refers to "557.6 Mb". I cannot add more RAM to this machine at the moment as it is a company laptop.
Question: Is there a way to load the files in chunks or some other way (in re-writing the code) to get around the error?