I'm currently webscraping a news-site using rvest. The scraper is working, but on the news site, i got limited access to the exclusive articles listed there. Hence i need a working loop, that doesn't stop when facing the case of non-avaiability of certain selectors.
On top of that, i don't find the proper selector to scrape the whole text. Hopefully you can help me with my problem.
library(rvest)
sz_webp <- read_html ("https://www.sueddeutsche.de/news?search=Corona&sort=date&all%5B%5D=dep&all%5B%5D=typ&all%5B%5D=sys&all%5B%5D=time")
# TITLE
title <- sz_webp %>%
html_nodes("a em") %>%
html_text()
df <- data.frame(title)
# TIME
time <- sz_webp %>%
html_nodes("div time") %>%
html_text()
df$time <- time
url <- sz_webp %>%
html_nodes("a") %>% html_attr('href')
url <- url[which(regexpr('https://www.sueddeutsche.de/', url) >= 1)]
N <- 58
n_url <- tail(url, -N)
n_url <- head(n_url,-17)
View(n_url)
df$url <- n_url
# LOOP THAT DOESNT WORK (not the right selector and it cancels when facing the problem)
results_df <- lapply(n_url, function(u) {
message(u)
aktuellerlink <- read_html(u) # liest die jeweilige URL ein
text <- aktuellerlink %>% # liest das Baujahr aus
html_nodes("div p") %>%
html_text()
} %>%
bind_rows()
)
df$text <- results_df
View(df)
Thanks a lot in advance.