I have the following dataframe. It has 1 column of text that I would like to separate into multiple columns using the separate function from dplyr.
df <- structure(list(CPT.Codes = structure(c(2L, 1L, 3L, 4L, 5L), .Label = c("28296 - CORRECTION OF BUNION...., 64445P - N BLOCK INJ, SCIATIC, SNG, 76942P - US GUIDE, NEEDLE PLACEMENT",
"36821 - AV FUSION DIRECT ANY SITE, 99100P - ANESTHESIA FOR PT OF EXTREME AGE",
"41899 - DENTAL SURGERY PROCEDURE", "50593 - PERC CRYO ABLATE RENAL TUM, 99100P - ANESTHESIA FOR PT OF EXTREME AGE",
"64721 - CARPAL TUNNEL SURGERY"), class = "factor")), class = "data.frame", row.names = c(NA,
-5L))
My desired output is the dataframe below. Each 5-digit number or 5-digit number + letter represents the code and the text following the dash is the code's description. Sometime's the code's description has single-digit numbers and multiple commas, so the regular expression will need to recognize the 5-digit number as a new code.
dfDesired <- structure(list(CPTcode1 = c(36821L, 28296L, 41899L, 50593L, 64721L
), CPTdescrip1 = structure(c(1L, 3L, 4L, 5L, 2L), .Label = c("AV FUSION DIRECT ANY SITE",
"CARPAL TUNNEL SURGERY", "CORRECTION OF BUNION....", "DENTAL SURGERY PROCEDURE",
"PERC CRYO ABLATE RENAL TUM"), class = "factor"), CPTcode2 = structure(c(2L,
1L, NA, 2L, NA), .Label = c("64445P", "99100P"), class = "factor"),
CPTdescrip2 = structure(c(1L, 2L, NA, 1L, NA), .Label = c("ANESTHESIA FOR PT OF EXTREME AGE",
"N BLOCK INJ"), class = "factor"), CPTcode3 = structure(c(NA,
1L, NA, NA, NA), .Label = "76942P", class = "factor"), CPTdescrip3 = structure(c(NA,
1L, NA, NA, NA), .Label = "US GUIDE NEEDLE PLACEMENT", class = "factor")), class = "data.frame", row.names = c(NA,
-5L))
I have tried variations of the code below. It is wrong. I am new to regular expressions, and cannot figure this out with existing examples.
CPT %>%
separate(CPT.Codes,
into = c("CPTcode1", "CPTdescrip1", "CTPcode2", "CPTdescrip2", "CPTcode3", "CPTdescrip3"),
sep = "(?<=[A-Z]) ?(?=[0-9])", remove = F) %>%
glimpse
Thanks in advance.