Hey I would like to extract names from a text - my identifying pattern is that names will always start with a capital letter and there will be two or three words with a capital letter in a row. Furthermore, I account for the fact that there could be an author called "Jack Jr. Bones" - so I make the "." optional. The last case could be that there is an institution in the text with an article e.g. "the Robert Brown theater, so I would like to exclude all cases, where the two/three words with a capital letter are preceded by a "the". I do this by using a negative lookbehind:
test <- test <- "A beautiful day for Jack Bones ended in the Robert Brown theater"
str_extract(test, "(?<!the\\s)(([A-Z][\\w]+\\s[A-Z][\\w]+[[:punct:]]?\\s[A-Z][\\w]+)|([A-Z][\\w]+\\s[A-Z][\\w]+))")
[1] "Jack Bones"
But now I am facing the following problem: if a sentence starts with "The Robert Brown theater" then I will match this pattern too. I thought I can be smart and just add "(?i) in the negative look behind, but it turn out it does not work
test <- "The Robert Brown theater was nice, but Jack Bones did not enjoy his time there"
str_extract(test, "(?<!(?i)the\\s)(([A-Z][\\w]+\\s[A-Z][\\w]+[[:punct:]]?\\s[A-Z][\\w]+)|([A-Z][\\w]+\\s[A-Z][\\w]+))")
[1] "The Robert Brown"
Another idea was to just add an or condition
str_extract(test, "(?<!(the\\s|The\\s))(([A-Z][\\w]+\\s[A-Z][\\w]+[[:punct:]]?\\s[A-Z][\\w]+)|([A-Z][\\w]+\\s[A-Z][\\w]+))")
[1] "The Robert Brown"
Then I tried if it would work if a use only "The" in the negative look behind and I discovered that even this would not work
str_extract(test, "(?<!The\\s)(([A-Z][\\w]+\\s[A-Z][\\w]+[[:punct:]]?\\s[A-Z][\\w]+)|([A-Z][\\w]+\\s[A-Z][\\w]+))")
[1] "The Robert Brown"
Now I am a little bit clueless. I do not understand why the negative look behind works with "the", but fails to work if I condition on "The". I would appreciate any help and insight!