I have multiple .xls (~100MB) files from which I would like to load one sheet (in fact, just one column of numbers) into R as a dataframe. I have tried various functions, such as xlsx::xlsx2
and XLConnect::readWorksheetFromFile
, both of which always run for a very long time (>15 mins) and never finish and I have to force-quit RStudio to keep working.
I also tried gdata::read.xls
, which does finish and does exactly what I would like it to, but it takes more than 3 minutes per one file. The time it takes these functions to execute (and I am not even sure the first two would ever finish if I let them go longer) is way too long for my pipeline, where I need to work with many files at once. Is there a way to get these to go faster?
In several places, I have seen a recommendation to use the function readxl::read_xls
, which seems to be widely recommended for this task and should be faster (which is what I hope for anyway). This one, however, gives me an error:
> # Minimal reproducible example:
> setwd("/Users/USER/Desktop")
> library(readxl)
> data <- read_xls(path="test_file.xls")
Error:
filepath: /Users/USER/Desktop/test_file.xls
libxls error: Unable to open file
I also did some elementary testing to make sure the file exists and is in the correct format:
> # Testing existence & format of the file
> file.exists("test_file.xls")
[1] TRUE
> format_from_ext("test_file.xls")
[1] "xls"> format_from_signature("test_file.xls")
[1] "xls"
The test_file.xls
used above is available here.
Any advice would be appreciated in terms of making the first functions run faster or the read_xls
run at all - thank you!