Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all articles
Browse latest Browse all 201945

Very slow or failing xls data extraction in R [functions read.xls, read_xls, xlsx2, etc.]

$
0
0

I have multiple .xls (~100MB) files from which I would like to load one sheet (in fact, just one column of numbers) into R as a dataframe. I have tried various functions, such as xlsx::xlsx2 and XLConnect::readWorksheetFromFile, both of which always run for a very long time (>15 mins) and never finish and I have to force-quit RStudio to keep working.

I also tried gdata::read.xls, which does finish and does exactly what I would like it to, but it takes more than 3 minutes per one file. The time it takes these functions to execute (and I am not even sure the first two would ever finish if I let them go longer) is way too long for my pipeline, where I need to work with many files at once. Is there a way to get these to go faster?

In several places, I have seen a recommendation to use the function readxl::read_xls, which seems to be widely recommended for this task and should be faster (which is what I hope for anyway). This one, however, gives me an error:

> # Minimal reproducible example:
> setwd("/Users/USER/Desktop")
> library(readxl)
> data <- read_xls(path="test_file.xls")
Error: 
  filepath: /Users/USER/Desktop/test_file.xls
  libxls error: Unable to open file

I also did some elementary testing to make sure the file exists and is in the correct format:

> # Testing existence & format of the file
> file.exists("test_file.xls")
[1] TRUE
> format_from_ext("test_file.xls")
[1] "xls"> format_from_signature("test_file.xls")
[1] "xls"

The test_file.xls used above is available here. Any advice would be appreciated in terms of making the first functions run faster or the read_xls run at all - thank you!


Viewing all articles
Browse latest Browse all 201945

Trending Articles