Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all articles
Browse latest Browse all 201867

How to extract unique string in between string pattern in full text in R?

$
0
0

I'm looking to extract names and professions of those who testified in front of Congress from the following text:

text <- c(("FULL COMMITTEE HEARINGS\\", \\" 2017\\",\n\\" April 6, 2017—‘‘The 2017 Tax Filing Season: Internal Revenue\\", \", \"\\"\nService Operations and the Taxpayer Experience.’’ This hearing\\", \\" examined\nissues related to the 2017 tax filing season, including\\", \\" IRS performance,\ncustomer service challenges, and information\\", \\" technology. Testimony was\nheard from the Honorable John\\", \\" Koskinen, Commissioner, Internal Revenue\nService, Washington,\\", \", \"\\" DC.\\", \\" May 25, 2017—‘‘Fiscal Year 2018 Budget\nProposals for the Depart-\\", \\" ment of Treasury and Tax Reform.’’ The hearing\ncovered the\\", \\" President’s 2018 Budget and touched on operations of the De-\n\\", \\" partment of Treasury and Tax Reform. Testimony was heard\\", \\" from the\nHonorable Steven Mnuchin, Secretary of the Treasury,\\", \", \"\\" United States\nDepartment of the Treasury, Washington, DC.\\", \\" July 18, 2017—‘‘Comprehensive\nTax Reform: Prospects and Chal-\\", \\" lenges.’’ The hearing covered issues\nsurrounding potential tax re-\\", \\" form plans including individual, business,\nand international pro-\\", \\" posals. Testimony was heard from the Honorable\nJonathan Talis-\\", \", \"\\" man, former Assistant Secretary for Tax Policy 2000–\n2001,\\", \\" United States Department of the Treasury, Washington, DC; the\\",\n\\" Honorable Pamela F. Olson, former Assistant Secretary for Tax\\", \\" Policy\n2002–2004, United States Department of the Treasury,\\", \\" Washington, DC; the\nHonorable Eric Solomon, former Assistant\\", \", \"\\" Secretary for Tax Policy\n2006–2009, United States Department of\\", \\" the Treasury, Washington, DC; and\nthe Honorable Mark J.\\", \\" Mazur, former Assistant Secretary for Tax Policy\n2012–2017,\\", \\" United States Department of the Treasury, Washington, DC.\\",\n\\" (5)\\", \\"VerDate Sep 11 2014 14:16 Mar 28, 2019 Jkt 000000 PO 00000 Frm 00013\nFmt 6601 Sfmt 6601 R:\\\\DOCS\\\\115ACT.000 TIM\\"\", \")\")" )

The full text is available here: https://www.congress.gov/116/crpt/srpt19/CRPT-116srpt19.pdf

It seems that the names are in between "Testimony was heard from" until the next ".". So, how can I extract the names between these two patterns? The text is much longer (50 page document), but I figured that if I can do it one, I'll do it for the rest of the text.

I know I can't use NLP for name extraction because they are names of persons that didn't testify, for example.


Viewing all articles
Browse latest Browse all 201867

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>