Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all articles
Browse latest Browse all 208657

Split into column treating successive delimiters as one

$
0
0

Hi I would like to split one column of a data.frame into multiple columns but with successive delimiters treated as one. My input has been scraped from a text file so is a bit of a mess with different delimiters and sometimes the same one duplicated multiple times. In my example below I am using space, comma, "and" or dash as the delimiters but I actually have >6 different ones including several words ("and" and "incl").

I would normally use tidyr::separate but it doesn't have an option combine successive delimiters. Trying to make an exhaustive list of the possible combinations for the pattern soon gets ridiculous especially as sometimes I might have 4 or 5 spaces or commas in a row.

I have provided a reprex and desired output (made by manually changing the text which is not feasible in my real data of 1000s of lines) below

Data:

library(tidyr)

testdf <- data.frame(test = c("This string has single spaces",
                              "This  one  has  double  spaces",
                              "This, has, comma,or space,   or ,both",
                              "This,one-, space,- comma -,and-dash"))

These are the codes I have tried to use so far:

separate(testdf, test, into = letters[1:12], sep = " |,|-|and", fill = "right")

#> Warning: Expected 12 pieces. Additional pieces discarded in 2 rows [3, 4].
#>      a      b   c      d      e     f      g    h      i     j    k    l
#> 1 This string has single spaces  <NA>   <NA> <NA>   <NA>  <NA> <NA> <NA>
#> 2 This        one           has       double      spaces  <NA> <NA> <NA>
#> 3 This        has               comma     or             space          
#> 4 This        one               space                    comma

#sort of starting to work but gets very extensive very fast
separate(testdf, test, into = letters[1:12], sep = "  |, |, | |and|,", fill = "right")

#>      a      b    c      d      e    f     g     h    i     j    k    l
#> 1 This string  has single spaces <NA>  <NA>  <NA> <NA>  <NA> <NA> <NA>
#> 2 This    one  has double spaces <NA>  <NA>  <NA> <NA>  <NA> <NA> <NA>
#> 3 This    has       comma     or            space         or      both
#> 4 This        one-  space      -      comma     -      -dash <NA> <NA>

Based on Gregor's answer before I specified I needed word delimiters:


separate(testdf, test, into = letters[1:12], sep = "[ ,-]+", fill = "right")
#>      a      b        c      d      e     f    g    h    i    j    k    l
#> 1 This string      has single spaces  <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 2 This    one      has double spaces  <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 3 This    has andcomma     or    and space   or both <NA> <NA> <NA> <NA>
#> 4 This    one    space    and  comma   and dash <NA> <NA> <NA> <NA> <NA>


###*Desired Output:*
```r
#>      a      b     c      d      e    f    g
#> 1 This string   has single spaces <NA> <NA>
#> 2 This    one   has double spaces <NA> <NA>
#> 3 This    has comma     or  space   or both
#> 4 This    one space  comma    dash <NA> <NA>

Created on 2019-10-30 by the reprex package (v0.3.0)


Viewing all articles
Browse latest Browse all 208657

Latest Images

Trending Articles



Latest Images