Problem: Hi all, I have this sample dataframe which has institution names I need to extract:
mydf<- data.frame(ID=c('1', '2', '3'), Institution=c('Univ of Space, TX, US', '[Bloggs, J., Smith, T.] Univ of Time, CA, US', '[Windz, P., Lol, D.] College of the World, CA, US' ))
I need to extract the institution names only, such that it would appear like this:
1 Univ of Space
2 Univ of Time
3 College of the World
I don't care about any of the other characters in the institution string, only everything until the first comma. The issue is I have some instances where the institution name will be preceded by a bracket and sometimes on its own (as in the case of the first row).
I've written the following to extract these two instances separately:
ex_inst<- str_extract_all(mydf$Institution,"(?<=])(.+?)(?=,)", simplify = TRUE)
ex_inst2<- str_extract_all(mydf$Institution,"^(.+?)(?=,)", simplify = TRUE)
I'm struggling to combine them together. I have looked into the alternation, and tried this
ex_inst3<- str_extract_all(mydf$Institution,"^(.+?)(?=,)|(?<=])(.+?)(?=,)", simplify = TRUE)
But I'm not experienced with regex and am confused by what it's outputting:
[1,] "Univ of Space"""
[2,] "[Bloggs"" Univ of Time"
[3,] "[Windz"" College of the World"
What's the best way to combine this with stringr, can I use some sort of if else statement? thanks.