Hi I would like to split one column of a data.frame into multiple columns but with successive delimiters treated as one. My input has been scraped from a text file so is a bit of a mess with different delimiters and sometimes the same one duplicated multiple times. In my example below I am using space, comma, "and" or dash as the delimiters but I actually have >6 different ones including several words ("and" and "incl").
I would normally use tidyr::separate
but it doesn't have an option combine successive delimiters. Trying to make an exhaustive list of the possible combinations for the pattern soon gets ridiculous especially as sometimes I might have 4 or 5 spaces or commas in a row.
I have provided a reprex and desired output (made by manually changing the text which is not feasible in my real data of 1000s of lines) below
Data:
library(tidyr)
testdf <- data.frame(test = c("This string has single spaces",
"This one has double spaces",
"This, has, comma,or space, or ,both",
"This,one-, space,- comma -,and-dash"))
These are the codes I have tried to use so far:
separate(testdf, test, into = letters[1:12], sep = " |,|-|and", fill = "right")
#> Warning: Expected 12 pieces. Additional pieces discarded in 2 rows [3, 4].
#> a b c d e f g h i j k l
#> 1 This string has single spaces <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 2 This one has double spaces <NA> <NA> <NA>
#> 3 This has comma or space
#> 4 This one space comma
#sort of starting to work but gets very extensive very fast
separate(testdf, test, into = letters[1:12], sep = " |, |, | |and|,", fill = "right")
#> a b c d e f g h i j k l
#> 1 This string has single spaces <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 2 This one has double spaces <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 3 This has comma or space or both
#> 4 This one- space - comma - -dash <NA> <NA>
Based on Gregor's answer before I specified I needed word delimiters:
separate(testdf, test, into = letters[1:12], sep = "[ ,-]+", fill = "right")
#> a b c d e f g h i j k l
#> 1 This string has single spaces <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 2 This one has double spaces <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 3 This has andcomma or and space or both <NA> <NA> <NA> <NA>
#> 4 This one space and comma and dash <NA> <NA> <NA> <NA> <NA>
###*Desired Output:*
```r
#> a b c d e f g
#> 1 This string has single spaces <NA> <NA>
#> 2 This one has double spaces <NA> <NA>
#> 3 This has comma or space or both
#> 4 This one space comma dash <NA> <NA>
Created on 2019-10-30 by the reprex package (v0.3.0)