Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all articles
Browse latest Browse all 206503

Equivalent of a while-loop with purrr::map

$
0
0

I build a webscraper function and want to scrape over multiple pages. Since I only want data in a specific period and the scraping process needs some effort since the site can become quit slow somtimes, I did serveral test runs before and it does not look like, that the scraping process is the reason for the sites behavoiour.

The scraping function :

#Regex filter function to reduce the amount of scraped reports from single pages
target_regex <- "(xtm)|((k|K)(i|I|1|11)(d|D)(n|N).)|(Ar<e)\\s(you)\\s(in)| 
(LOAN)|(AR(\\s|\\S)[0-9])|((B|b)(i|1|l)tc.)|(Coupon)|(Plastic.King)|(organs)|(SILI)|(Electric.Cigarette.Machine)"
all.nodes <- c(".heading-3 a",".paid-amount span", ".date", ".location", ".transaction a")
#Add mutate and add if function to stop scraping after 2018 is reache

scraper_info <- function(pages){
  bribe <- read_html(paste("http://ipaidabribe.com/reports/paid?page", pages, sep = "="))
  map(all.nodes, ~ html_nodes(bribe, .x) %>%
        html_text()) %>%
    as_tibble(.name_repair = "unique") %>%
    filter(str_detect(...1, target_regex, negate = TRUE)) %>%
    mutate(Report =(
            report <- html_nodes(bribe, ".read-more") %>% 
             html_attr("href") %>% 
             as_tibble(.name_repair = "unique") %>% 
             filter(str_detect(value, target_regex, negate = TRUE)) %>% 
             mutate(text = map_chr(value, ~read_html(.x) %>%  
                              html_node(".body-copy-lg") %>% 
                              html_text)))$text)
}


pages <- seq(10, 20, by = 10)
bribe <- map_df(pages, ~scraper_info(.x)) %>% rename(  "Title"= ...1 ,   "Bribe" = ...2, "Date" = ...3,
                                                       "Location" = ...4 , "Department" = ...5 )`

I want to stop the map when the Date(...3 in the tibble) is unequal to 2019. Is there anything like a map_while to do such a thing?


Viewing all articles
Browse latest Browse all 206503

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>