Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all articles
Browse latest Browse all 201894

RVest: Scraping the text of a website with limited access

$
0
0

I'm currently webscraping a news-site using rvest. The scraper is working, but on the news site, i got limited access to the exclusive articles listed there. Hence i need a working loop, that doesn't stop when facing the case of non-avaiability of certain selectors.

On top of that, i don't find the proper selector to scrape the whole text. Hopefully you can help me with my problem.

library(rvest)
sz_webp <- read_html ("https://www.sueddeutsche.de/news?search=Corona&sort=date&all%5B%5D=dep&all%5B%5D=typ&all%5B%5D=sys&all%5B%5D=time")

# TITLE

title <- sz_webp %>% 
  html_nodes("a em") %>%   
  html_text()

df <- data.frame(title)

# TIME

time <- sz_webp %>% 
  html_nodes("div time") %>%   
  html_text() 

df$time <- time

url <- sz_webp %>% 
  html_nodes("a") %>% html_attr('href')

url <- url[which(regexpr('https://www.sueddeutsche.de/', url) >= 1)]
N <- 58
n_url <- tail(url, -N)

n_url <- head(n_url,-17)

View(n_url)

df$url <- n_url

# LOOP THAT DOESNT WORK (not the right selector and it cancels when facing the problem)

results_df <- lapply(n_url, function(u) { 
  message(u) 

  aktuellerlink <- read_html(u) # liest die jeweilige URL ein

  text <- aktuellerlink %>% # liest das Baujahr aus
    html_nodes("div p") %>%
    html_text()

  } %>%

bind_rows()
)
df$text <- results_df

View(df)

Thanks a lot in advance.


Viewing all articles
Browse latest Browse all 201894

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>