Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all articles
Browse latest Browse all 201867

Scraping pages with inconsistent lengths in dataframe

$
0
0

I want to scrape all the names from this page. With the result of one tibble of three columns. My code only works if all the data is there hence my error:

 Error: Tibble columns must have consistent lengths, only values of length one are recycled:
* Length 20: Columns `huisarts`, `url`
* Length 21: Column `praktijk`

How can I let my code run but fill with Na's in tibble if the data isn't there.

My code for a pauzing robot later used in scraper function:

pauzing_robot <- function (periods = c(0, 1)) {
      tictoc <- runif(1, periods[1], periods[2])
      cat(paste0(Sys.time()), 
          "- Sleeping for ", round(tictoc, 2), "seconds\n")
      Sys.sleep(tictoc)
    }

Scraper:

library(tidyverse)
library(rvest)

scrape_page <- function(pagina_nummer) {

  page <- read_html(paste0("https://www.zorgkaartnederland.nl/huisarts/pagina", pagina_nummer)) 

  pauzing_robot(periods = c(0, 1.5))

  tibble(

    huisarts = page %>% 
      html_nodes(".media-heading.title.orange") %>% 
      html_text() %>% 
      str_trim(), 

    praktijk = page %>% 
      html_nodes(".location") %>% 
      html_text() %>%
      str_trim(),

    url = page %>% 
      html_nodes(".media-heading.title.orange") %>% 
      html_nodes("a") %>%
      html_attr("href") %>% 
      str_trim() %>% 
      paste0("https://www.zorgkaartnederland.nl", .)
  )
}

Total number of pages 445, but for example sake only scraping three:

huisartsen <- map_df(sample(1:3), scrape_page)

Page 2 seems to be the problem with inconsistent lengths because this code works:

huisartsen <- map_df(3:4, scrape_page)

If possible with tidyverse code. Thanks in advance.


Viewing all articles
Browse latest Browse all 201867

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>