I want to be able to use grepl()
and gsub()
only outside of given sets of delimiters, for instance I want to be able to ignore text between quotes.
Here is my desired output :
grepl2("banana", "'banana' banana \"banana\"", escaped =c('""', "''"))
#> [1] TRUE
grepl2("banana", "'banana' apple \"banana\"", escaped =c('""', "''"))
#> [1] FALSE
grepl2("banana", "{banana} banana {banana}", escaped = "{}")
#> [1] TRUE
grepl2("banana", "{banana} apple {banana}", escaped = "{}")
#> [1] FALSE
gsub2("banana", "potatoe", "'banana' banana \"banana\"")
#> [1] "'banana' potatoe \"banana\""
gsub2("banana", "potatoe", "'banana' apple \"banana\"")
#> [1] "'banana' apple \"banana\""
gsub2("banana", "potatoe", "{banana} banana {banana}", escaped = "{}")
#> [1] "{banana} potatoe {banana}"
gsub2("banana", "potatoe", "{banana} apple {banana}", escaped = "{}")
#> [1] "{banana} apple {banana}"
Real cases might have quoted substrings in different amounts and order.
I have written the following functions which work for these cases, but they are clunky and gsub2()
is not robust at all as it replaces the delimited content with placeholders temporarily, and these placeholders might be affected by subsequent operations.
regex_escape <-
function(string,n = 1) {
for(i in seq_len(n)){
string <- gsub("([][{}().+*^$|\\?])", "\\\\\\1", string)
}
string
}
grepl2 <-
function(pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE,
useBytes = FALSE, escaped =c('""', "''")){
escaped <- strsplit(escaped,"")
# TODO check that "escaped" delimiters are balanced and don't cross each other
for(i in 1:length(escaped)){
close <- regex_escape(escaped[[i]][[2]])
open <- regex_escape(escaped[[i]][[1]])
pattern_i <- sprintf("%s.*?%s", open, close)
x <- gsub(pattern_i,"",x)
}
grepl(pattern, x, ignore.case, perl, fixed, useBytes)
}
gsub2 <- function(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE, escaped =c('""', "''")){
escaped <- strsplit(escaped,"")
# TODO check that "escaped" delimiters are balanced and don't cross each other
matches <- character()
for(i in 1:length(escaped)){
close <- regex_escape(escaped[[i]][[2]])
open <- regex_escape(escaped[[i]][[1]])
pattern_i <- sprintf("%s.*?%s", open, close)
ind <- gregexpr(pattern_i,x)
matches_i <- regmatches(x, ind)[[1]]
regmatches(x, ind)[[1]] <- paste0("((",length(matches) + seq_along(matches_i),"))")
matches <- c(matches, matches_i)
}
x <- gsub(pattern, replacement, x, ignore.case, perl, fixed, useBytes)
for(i in seq_along(matches)){
pattern <- sprintf("\\(\\(%s\\)\\)", i)
x <- gsub(pattern, matches[[i]], x)
}
x
}
Is there a solution using regex and no placeholder ? Note that my current function supports multiple pairs of delimiters but I'll be satisfied by a solution that supports one pair only, and will not try to match substrings between simple quotes for instance.
It is also acceptable, to impose different delimiters, for instance {
and }
rather than 2 "
or 2 '
if it helps.
I am also fine with imposing perl = TRUE