Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all articles
Browse latest Browse all 201945

applying regular expression keeping commas in R

$
0
0

I want to apply text cleaning with regular expressions on a dataset. However I want to keep commas because I will need to divide the text after cleaning it based on the commas (,). the problem is that I am not extremely familiar with regex (I generally use quanteda and treat words separately as uni-grams, but in this case I can't because I need to treat the each X-gram as it is based on the commas.)

the dataset looks like this:

   ID         Key
   1         "Hello, dog_ food, This is it2, water"
   2         "wow! nice, love, yes"
   3         "1997"
   4           
   5         "blabla, 34 l lol, @IceCream, #nice #wow d, seriously Not"
              ....
   .
   .

Among the things I want to do I want to get rid of words shorter than 2 letters, get rid of anything that is not alphanumeric stem words that are uni-grams.

I tried with these commands to obtain only low-key alphanumerics and get rid of words shorter than 2 letters but I end up cancelling the commas too, and I am not sure how to do to avoid it

data$keys <- to_lower(data$keys)
data$keys <- str_replace_all(data$keys, "[^[:alnum:]]", "")
 data$keys <- gsub(" *\\b[[:alpha:]]{1,2}\\b *", "", data$keys) # Remove 1-2 letter words
 data$keys <- gsub("^ +| +$|( ) +", "\\1", data$keys)

the expected output should be something like

  ID         Key
   1         "hello, dog food, this, water"
   2         "wow nice, love, yes"
   3         "1997"
   4           
   5         "blabla, lol, icecream, nice wow, seriously not"
              ....
   .
   .
   .

so basically, everything lowercase, removing 2 letters words, removing any symbol that is not alphanumeric.

thank you very much in advance in advance for your help!


Viewing all articles
Browse latest Browse all 201945

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>