I would like to perform topic modeling on a collection of online sermons that I scraped using rvest.
I am cleaning and organizing using pattern matching, especially grepl.
The problem is that grepl fails to match apparently identical strings. The scraped text is a mixture of "unknown" and "UTF-8" encoding. Functions like "Encoding", "enc2native", "enc2utf8", "iconv" don't seem to help, nor does adjusting grepl arguments like Perl=TRUE or useBytes = TRUE. (Not that I fully understand what all of these do.)
There seems to be several posts on this: (1) Troubles with encoding, pattern matching and noisy texts in R (2) https://community.rstudio.com/t/enconding-solution-for-linux-and-windows-10/2055 (3) R on Windows: character encoding hell and others.
With respect to #1, I am working in English and not Swedish so I do not see that changing my locale will help. Nor do I understand what portion of the code credited to Wiktor is fixing the problem in the answer provided by the original poster.
With respect to #2, as you'll see below I have attempted using Encoding() to change but with no success.
I am including #3 as a demonstration that many posts discuss foreign languages, while I'm staying in English. They also discuss difficulty with Windows 10 and encoding in RStudio, if that's relevant.
Here is my attempt at reproducible code. Unfortunately, the error seems to come from my original files and isn't reproducible by copying and pasting the following. This is demonstrated by the different results from charToRaw under Edit #1. Per the comment, I added a file on GitHub that contains the error when loaded on my session. Per another comment, I am also adding library calls, and removing some of the whitespace in the center of the "scrapedtitle" because the stackoverflow formatting otherwise introduces a new line character in the middle of the "author" variable. At the end of Edit #2 I have also tried to create a way to copy and paste the troubled encoding using rawToChar but can't coerce to "raw." In Edit #3 I discuss the RStudio options for Encoding, and describe that I saved different scraped portions using different Encoding settings, but didn't keep track of which ones I used when, unfortunately. I expected that the information could have been recoverable and reversible but that doesn't appear to be the case.
#Library calls
library(topicmodels)
library(LDAvis)
library(tm)
library(dplyr)
library(magrittr)
library(stringr)
#The scraped title of a sermon
scrapedtitle <- "Answers to Prayer\n\t\t\t\t\t\n\t\t\t\t\t\tBrook P. Hales"
#Extract the author from the title
author <- scrapedtitle %>% substr(x=.,start=regexpr("\t[[:alpha:]]", .)+1, stop = nchar(.))
#Elsewhere, identify the author from another scraped list of sermons and authors:
scrapedvector <- c("Answers to Prayer", "Brook P. Hales", "Church Auditing Department Report, 2018", "Russell M. Nelson", "By Elder Brook P. Hales")
#attempted grepl:
which(grepl(author, scrapedvector)) # only returns 2 when it should return 2 and 5
#Exploring:
typed <-"By Elder Brook P. Hales" #This is typed in from my keyboard
typed == scrapedvector[5] # FALSE unexpectedly
grepl(author, typed) #TRUE as you'd expect
grepl(author, scrapedvector[5]) # FALSE unexpectedly
#Checking encoding
Encoding(scrapedvector) #[1] "unknown""unknown""unknown""unknown""UTF-8"
Encoding(typed) #[1] "unknown"
Encoding(author) #[1] "unknown"
#Attempting to change the encoding:
Encoding(scrapedvector) <- "UTF-8"
Encoding(scrapedvector) # [1] "unknown""unknown""unknown""unknown""UTF-8" # No change
Edit #1:
# Adding charToRaw information:
charToRaw(typed)
# [1] 42 79 20 45 6c 64 65 72 20 42 72 6f 6f 6b 20 50 2e 20 48 61 6c 65 73
charToRaw(scrapedvector[5])
# [1] 42 79 20 45 6c 64 65 72 20 42 72 6f 6f 6b c2 a0 50 2e 20 48 61 6c 65 73
# There's an extra "c2 a0" in the scraped version at the 15th position.
# Results from pasting the vector back into R from this stackoverflow post:
repastedvector <- c("Answers to Prayer", "Brook P. Hales", "Church Auditing Department Report, 2018", "Russell M. Nelson", "By Elder Brook P. Hales")
charToRaw(repastedvector[5])
# [1] 42 79 20 45 6c 64 65 72 20 42 72 6f 6f 6b 20 50 2e 20 48 61 6c 65 73
# The repasted string is identical to what I typed, but not to what I saved after scraping.
# Posting this because it is mentioned in other posts
Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
Edit #2
An example of the file is available on Github:
https://github.com/baprisbrey/stackoverflow/releases/tag/vA0
The file is scrapedTalk2.rds.
This is what I see when I load this file into my RStudio session:
scrapedTalk <- readRDS("scrapedTalk2.rds")
grepl(author, scrapedTalk) %>% which() # Result is 8. It should be 8 and 73
scrapedvector2 <- scrapedTalk[c(7,8,18,72,73)] # This is the same as the scrapedvector from above
Encoding(scrapedTalk)
[1] "unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown"
[12] "unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown"
[23] "unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown"
[34] "unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown"
[45] "unknown""unknown""unknown""unknown""unknown""UTF-8""unknown""UTF-8""unknown""unknown""unknown"
[56] "UTF-8""unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown"
[67] "unknown""unknown""unknown""unknown""unknown""unknown""UTF-8""unknown""unknown""UTF-8""UTF-8"
[78] "UTF-8""UTF-8""unknown""UTF-8""unknown""UTF-8""UTF-8""unknown""unknown""UTF-8""UTF-8"
[89] "UTF-8""UTF-8""UTF-8""UTF-8""UTF-8""UTF-8""UTF-8""unknown""UTF-8""unknown""UTF-8"
[100] "UTF-8""UTF-8""UTF-8""UTF-8""UTF-8""UTF-8""UTF-8""UTF-8""unknown""UTF-8""unknown"
[111] "unknown""UTF-8""UTF-8""unknown""unknown""unknown""unknown""UTF-8""UTF-8""UTF-8""UTF-8"
[122] "UTF-8""UTF-8""UTF-8""unknown""UTF-8""UTF-8""UTF-8""unknown""unknown""unknown""unknown"
[133] "UTF-8""UTF-8""UTF-8""UTF-8""UTF-8""UTF-8""UTF-8""unknown""UTF-8"
scrapedTalk[73] == "By Elder Brook P. Hales" # FALSE, which is unexpected.
charToRaw(scrapedTalk[73]) # for reference
[1] 42 79 20 45 6c 64 65 72 20 42 72 6f 6f 6b c2 a0 50 2e 20 48 61 6c 65 73
# Can I create the troubled encoding by pasting the charToRaw result above?
# Note: There may be an unintentional newline "/n" character introduced in there due to the length of the string and the StackOverflow formatting. It should be removed.
troubleString <- "42 79 20 45 6c 64 65 72 20 42 72 6f 6f 6b c2 a0 50 2e 20 48 61 6c 65 73" %>%
strsplit(. ,split="") %>% # so far so good
unlist %>% # no troubles
as.raw %>% # NA's and 0's introduced
rawToChar # failure!
Edit #3 Because the problem appears to be Encoding, I am including a discussion of RStudio encoding options. Under RStudio File >> Save With Encoding is the following menu with options:
There are multiple options for encoding. I do not know what the difference between all of these are. The first question is, why doesn't Encoding() reveal all of these options? Surely the "unknown" bucket covers most of these. Second, due to Encoding difficulties, I toggled with Encoding options and it is highly likely that some of the scraped material was saved with one of these other Encoding options. I don't recall which ones I tried with which portions of the scraped material, however. I recognize the ambiguity this introduces into the problem. I would like to know why I can't recover the proper Encoding, convert to another Encoding, but mostly why I can't enable grepl to work.