Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all articles
Browse latest Browse all 201894

R encoding problem while web scraping - how to fix broken text?

$
0
0

While web scraping, some of the text retrieved was broken, very similar with foreign text when the incorrect encoding is used. The problem is: the encoding seems to be correct: "UTF-8". Is there any way to fix the text, even though it is supposedly in the correct format? The chunk of code below is the problem reported here. Rstudio is configured with "UTF-8" encoding, and functions that changes the encoding used always returns even more gibberish. Thank you all in advance.

library(rvest)

url <- "https://www1.folha.uol.com.br/poder/2020/01/folhas-da-manha-da-tarde-e-da-noite-se-uniram-sob-um-so-titulo-folha-de-spaulo-ha-60-anos.shtml"

title.news <- html_text(read_html(url) %>%
    html_nodes('body') %>%
    html_nodes('main') %>%
    html_nodes('article') %>%
    html_nodes('.block') %>%
    html_nodes('h1'))

title.news <- trimws(gsub(pattern = '\\s+', '', title.news))

Encoding(title.news)
[1] "UTF-8"

title.news
[1] "Folhas da Manhã, da Tarde e da Noite se uniram sob um só título, Folha de S.Paulo, há 60 anos"

#Desired Output: Folhas da Manhã, da Tarde e da Noite se uniram sob um só título, Folha de S.Paulo, há 60 anos

Viewing all articles
Browse latest Browse all 201894

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>