I have pdf text that I need converted into "tidy" format. But I'm unsure about how to read in the pdf text without compromising the information I need. For example:
# install pacman package if you require it
if (!require("pacman")) install.packages("pacman")
# p_load installs and loads packages
pacman::p_load(tidyverse, pdftools, tabulizer)
pdf_txt_raw <- pdf_text(
"https://www.statcan.gc.ca/eng/statistical-programs/document/5027_D1_V10-eng.pdf"
) %>%
read_lines()
pdf_txt_raw
Using read_lines()
seems to give an error because whenever there are two lines in the "legal name" column, it messes up the tidy format I'm looking for. For example, the Loblaw Inc [4] should be fine to clean up because each operating name is separated by a comma and it is within the Loblaws line, giving me a clean category.
But the very fist legal name category is wrong due to a line break in the PDF - i.e., "Buy-Low Foods Limited Partnership" should be the legal name and the operating names within that category should be "AG Foods, Buy-Low Foods, Buy & Save Foods, Fine Foods, G&H Shop N' Save, Nesters Market".
Any tips on how to clean this properly and get the tidy format I'm looking for?