I'm trying to gather statistical data on 3,600 + Wikipedia pages for work. I am trying to automate it using web scrapping in R.
I have an issue extracting the HTML code directly in R.
download_html("xtools.wmflabs.org/articleinfo/fr.wikipedia.org/1re_Convention_nationale_acadienne")
And this is what the console tells me:
download_html("xtools.wmflabs.org/articleinfo/fr.wikipedia.org/1re_Convention_nationale_acadienne")
Error in curl::curl_download(url, file, quiet = quiet, mode = mode, handle = handle) : HTTP error 403.
What would be a possible reason this isn't working?
When I save the HTML as a file and run it through R, everything works perfectly and I get to make a dataframe with the results:
# ID webpage link first
setwd("C:\\Users\\judit\\Scraping dans R")
webpage <- read_html("HTML_1e.html")
# read_html("https://xtools.wmflabs.org/articleinfo/fr.wikipedia.org/1re_Convention_nationale_acadienne?uselang=fr")
# Statistiques: extraction ----
# Stats: titre
titre <- html_nodes(webpage, ".back-to-search+ a")
titre <- html_text(titre, trim=TRUE)
# Stats: Taille de page
taille <- html_nodes(webpage, ".col-lg-5 tr:nth-child(3) td+ td")
taille <- html_text(taille, trim=TRUE)
# Stats: Total des modifications
mod <- html_nodes(webpage, ".col-lg-5 tr:nth-child(4) td+ td")
mod <- html_text(mod, trim=TRUE)
# Stats: Nombre de redacteurs
red <- html_nodes(webpage, ".col-lg-5 tr:nth-child(5) td+ td")
red <- html_text(red)
# Stats: Evaluation
evaluation <- html_nodes(webpage, ".col-lg-5 tr:nth-child(6) td+ td")
evaluation <- html_text(evaluation, trim=TRUE)
# Stats: Liens vers cette page
liens_vers <- html_nodes(webpage, ".stat-list--group tr:nth-child(2) a")
liens_vers <- html_text(liens_vers, trim=TRUE)
# Stats: Liens depuis cette page
liens_depuis <- html_nodes(webpage, ".col-lg-offset-1 .stat-list--group tr:nth-child(4) td+ td")
liens_depuis <- html_text(liens_depuis, trim=TRUE)
# Stats: Mots
mots <- html_nodes(webpage, ".col-lg-3 tr:nth-child(3) td+ td")
mots <- html_text(mots, trim=TRUE)
wikipedia <- data.frame(titre, taille, red, mod, evaluation, liens_vers, liens_depuis)
Any advice is greatly appreciated! PS: Pardon my French in the code. It's my first language.