I am trying to get a links of pdfs from a site in R but the rvest read_html() function just sites there, seemingly making no progress.
Here is my code:
# Load required librarieslibrary(tidyverse)library(rvest)# Define the URLurl <- "https://providers.anthem.com/new-york-provider/claims/reimbursement-policies/"# Read and process the HTMLlinks <- try({ read_html(url) %>% html_node(xpath = "/html/body/main/div/div/div/section[3]/div/section/div[1]/section/div[1]/div/div[2]/div/p/a") %>% html_attr("href") %>% as_tibble() %>% rename(url = value)})# Display the results with error handlingif(!inherits(links, "try-error")) { print(links)} else { message("Unable to scrape the URL. This might be due to:") message("- Website requires authentication") message("- Website blocks automated scraping") message("- The XPath structure has changed") message("- Network connectivity issues")}
Maybe I should do this via httr2?
Here is an error message from xml2::read_html():
> read_html(url)Error in `open.connection()`:! cannot open the connectionHide Traceback▆ 1. ├─xml2::read_html(url) 2. └─xml2:::read_html.default(url) 3. ├─base::suppressWarnings(...) 4. │└─base::withCallingHandlers(...) 5. ├─xml2::read_xml(x, encoding = encoding, ..., as_html = TRUE, options = options) 6. └─xml2:::read_xml.character(...) 7. └─xml2:::read_xml.connection(...) 8. ├─base::open(x, "rb") 9. └─base::open.connection(x, "rb")