Pattern matching is failing due to apparent encoding problems with scraped text

I would like to perform topic modeling on a collection of online sermons that I scraped using rvest.

I am cleaning and organizing using pattern matching, especially grepl.

The problem is that grepl fails to match apparently identical strings. The scraped text is a mixture of "unknown" and "UTF-8" encoding. Functions like "Encoding", "enc2native", "enc2utf8", "iconv" don't seem to help, nor does adjusting grepl arguments like Perl=TRUE or useBytes = TRUE. (Not that I fully understand what all of these do.)

There seems to be several posts on this: (1) Troubles with encoding, pattern matching and noisy texts in R (2) https://community.rstudio.com/t/enconding-solution-for-linux-and-windows-10/2055 (3) R on Windows: character encoding hell and others.

With respect to #1, I am working in English and not Swedish so I do not see that changing my locale will help. Nor do I understand what portion of the code credited to Wiktor is fixing the problem in the answer provided by the original poster.

With respect to #2, as you'll see below I have attempted using Encoding() to change but with no success.

I am including #3 as a demonstration that many posts discuss foreign languages, while I'm staying in English. They also discuss difficulty with Windows 10 and encoding in RStudio, if that's relevant.

Here is my attempt at reproducible code. Unfortunately, the error seems to come from my original files and isn't reproducible by copying and pasting the following. This is demonstrated by the different results from charToRaw under Edit #1. Per the comment, I added a file on GitHub that contains the error when loaded on my session. Per another comment, I am also adding library calls, and removing some of the whitespace in the center of the "scrapedtitle" because the stackoverflow formatting otherwise introduces a new line character in the middle of the "author" variable. At the end of Edit #2 I have also tried to create a way to copy and paste the troubled encoding using rawToChar but can't coerce to "raw." In Edit #3 I discuss the RStudio options for Encoding, and describe that I saved different scraped portions using different Encoding settings, but didn't keep track of which ones I used when, unfortunately. I expected that the information could have been recoverable and reversible but that doesn't appear to be the case.

#Library calls
library(topicmodels)
library(LDAvis)
library(tm)
library(dplyr)
library(magrittr)
library(stringr)

#The scraped title of a sermon
scrapedtitle <- "Answers to Prayer\n\t\t\t\t\t\n\t\t\t\t\t\tBrook P. Hales"

#Extract the author from the title
author <- scrapedtitle %>% substr(x=.,start=regexpr("\t[[:alpha:]]", .)+1, stop = nchar(.))

#Elsewhere, identify the author from another scraped list of sermons and authors:
scrapedvector <- c("Answers to Prayer", "Brook P. Hales", "Church Auditing Department Report, 2018", "Russell M. Nelson", "By Elder Brook P. Hales")

#attempted grepl: 
which(grepl(author, scrapedvector)) # only returns 2 when it should return 2 and 5

#Exploring:
typed <-"By Elder Brook P. Hales" #This is typed in from my keyboard

typed == scrapedvector[5] # FALSE unexpectedly

grepl(author, typed) #TRUE as you'd expect
grepl(author, scrapedvector[5]) # FALSE unexpectedly

#Checking encoding
Encoding(scrapedvector) #[1] "unknown""unknown""unknown""unknown""UTF-8"
Encoding(typed) #[1] "unknown"
Encoding(author) #[1] "unknown"

#Attempting to change the encoding:
Encoding(scrapedvector) <- "UTF-8"
Encoding(scrapedvector) # [1] "unknown""unknown""unknown""unknown""UTF-8" # No change

Edit #1:

# Adding charToRaw information: 
charToRaw(typed)
# [1] 42 79 20 45 6c 64 65 72 20 42 72 6f 6f 6b 20 50 2e 20 48 61 6c 65 73
charToRaw(scrapedvector[5]) 
# [1] 42 79 20 45 6c 64 65 72 20 42 72 6f 6f 6b c2 a0 50 2e 20 48 61 6c 65 73
# There's an extra "c2 a0" in the scraped version at the 15th position.

# Results from pasting the vector back into R from this stackoverflow post:
repastedvector <- c("Answers to Prayer", "Brook P. Hales", "Church Auditing Department Report, 2018", "Russell M. Nelson", "By Elder Brook P. Hales")

charToRaw(repastedvector[5])
# [1] 42 79 20 45 6c 64 65 72 20 42 72 6f 6f 6b 20 50 2e 20 48 61 6c 65 73
# The repasted string is identical to what I typed, but not to what I saved after scraping.

# Posting this because it is mentioned in other posts
Sys.getlocale()

[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

Edit #2

An example of the file is available on Github: https://github.com/baprisbrey/stackoverflow/releases/tag/vA0
The file is scrapedTalk2.rds.

This is what I see when I load this file into my RStudio session:

scrapedTalk <- readRDS("scrapedTalk2.rds")
grepl(author, scrapedTalk) %>% which() # Result is 8.  It should be 8 and 73

scrapedvector2 <- scrapedTalk[c(7,8,18,72,73)] # This is the same as the scrapedvector from above 

Encoding(scrapedTalk)
 [1] "unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown"
 [12] "unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown"
 [23] "unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown"
 [34] "unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown"
 [45] "unknown""unknown""unknown""unknown""unknown""UTF-8""unknown""UTF-8""unknown""unknown""unknown"
 [56] "UTF-8""unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown""unknown"
 [67] "unknown""unknown""unknown""unknown""unknown""unknown""UTF-8""unknown""unknown""UTF-8""UTF-8"  
 [78] "UTF-8""UTF-8""unknown""UTF-8""unknown""UTF-8""UTF-8""unknown""unknown""UTF-8""UTF-8"  
 [89] "UTF-8""UTF-8""UTF-8""UTF-8""UTF-8""UTF-8""UTF-8""unknown""UTF-8""unknown""UTF-8"  
[100] "UTF-8""UTF-8""UTF-8""UTF-8""UTF-8""UTF-8""UTF-8""UTF-8""unknown""UTF-8""unknown"
[111] "unknown""UTF-8""UTF-8""unknown""unknown""unknown""unknown""UTF-8""UTF-8""UTF-8""UTF-8"  
[122] "UTF-8""UTF-8""UTF-8""unknown""UTF-8""UTF-8""UTF-8""unknown""unknown""unknown""unknown"
[133] "UTF-8""UTF-8""UTF-8""UTF-8""UTF-8""UTF-8""UTF-8""unknown""UTF-8"

scrapedTalk[73] == "By Elder Brook P. Hales" # FALSE, which is unexpected.

charToRaw(scrapedTalk[73]) # for reference
 [1] 42 79 20 45 6c 64 65 72 20 42 72 6f 6f 6b c2 a0 50 2e 20 48 61 6c 65 73

# Can I create the troubled encoding by pasting the charToRaw result above?
# Note:  There may be an unintentional newline "/n" character introduced in there due to the length of the string and the StackOverflow formatting.  It should be removed.
troubleString <-  "42 79 20 45 6c 64 65 72 20 42 72 6f 6f 6b c2 a0 50 2e 20 48 61 6c 65 73" %>%
                   strsplit(. ,split="") %>%  # so far so good
                   unlist %>%                  # no troubles
                   as.raw %>%                  # NA's and 0's introduced
                   rawToChar                   # failure!

Edit #3 Because the problem appears to be Encoding, I am including a discussion of RStudio encoding options. Under RStudio File >> Save With Encoding is the following menu with options:

There are multiple options for encoding. I do not know what the difference between all of these are. The first question is, why doesn't Encoding() reveal all of these options? Surely the "unknown" bucket covers most of these. Second, due to Encoding difficulties, I toggled with Encoding options and it is highly likely that some of the scraped material was saved with one of these other Encoding options. I don't recall which ones I tried with which portions of the scraped material, however. I recognize the ambiguity this introduces into the problem. I would like to know why I can't recover the proper Encoding, convert to another Encoding, but mostly why I can't enable grepl to work.

Pattern matching is failing due to apparent encoding problems with scraped text

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112