I want to apply text cleaning with regular expressions on a dataset. However I want to keep commas because I will need to divide the text after cleaning it based on the commas (,). the problem is that I am not extremely familiar with regex (I generally use quanteda and treat words separately as uni-grams, but in this case I can't because I need to treat the each X-gram as it is based on the commas.)
the dataset looks like this:
ID Key
1 "Hello, dog_ food, This is it2, water"
2 "wow! nice, love, yes"
3 "1997"
4
5 "blabla, 34 l lol, @IceCream, #nice #wow d, seriously Not"
....
.
.
Among the things I want to do I want to get rid of words shorter than 2 letters, get rid of anything that is not alphanumeric stem words that are uni-grams.
I tried with these commands to obtain only low-key alphanumerics and get rid of words shorter than 2 letters but I end up cancelling the commas too, and I am not sure how to do to avoid it
data$keys <- to_lower(data$keys)
data$keys <- str_replace_all(data$keys, "[^[:alnum:]]", "")
data$keys <- gsub(" *\\b[[:alpha:]]{1,2}\\b *", "", data$keys) # Remove 1-2 letter words
data$keys <- gsub("^ +| +$|( ) +", "\\1", data$keys)
the expected output should be something like
ID Key
1 "hello, dog food, this, water"
2 "wow nice, love, yes"
3 "1997"
4
5 "blabla, lol, icecream, nice wow, seriously not"
....
.
.
.
so basically, everything lowercase, removing 2 letters words, removing any symbol that is not alphanumeric.
thank you very much in advance in advance for your help!