Dear StackOverflow community
I'm a biologist and I'm working with a disease/genetic variants from ClinVar official database. My aim is to extract all gene names, transcripts and variants from this list.
ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/ClinVarFullRelease_2020-01.xml.gz
However, ClinVar offers the information I need in a single column called "Name". (I've separated some of the values with different results that I want to deal with in the example in the table below:)
Name ClinicalSignificance
1 NG_012236.2:g.11027del Pathogenic
2 NM_018077.3(RBM28):c.1052T>C (p.Leu351Pro) Pathogenic
3 NC_012920.1:m.7445A>G Pathogenic
4 m.7510T>C Pathogenic
5 NC_000023.11:g.(134493178_134493182)_(134501172_134501176)del Pathogenic
(there is other type of data, however since it does not contain the information I need I will treat it as garbage)
I am looking for a way to split the "Name" column in 3 other columns, using multiple separators. I've tried using "|" as part of my regex argument for multiple matches. However, for each time it works, sends the data that has already been separated to a column to the right. My code:
ClinVar_Clean <- separate(ClinVar_Clean, Name, into = c("Transcript","gene.var"),sep = "(?<=\\.[0-9]{1,2})[(]|(?<=[0-9]{3,16}\\.[0-9]{1,2}):|(?=[cmpng]\\.)")
ClinVar_Clean <- separate(ClinVar_Clean, gene.var, into = c("Gene","Variant"),sep = "\\):|(?=[cmpng]\\.)")
My result:
Transcript Gene Variant ClinicalSignificance
1 NG_012236.2 <NA> Pathogenic
2 NM_018077.3 RBM28 Pathogenic
3 NC_012920.1 <NA> Pathogenic
4 m.7510T>C Pathogenic
5 NC_000023.11 <NA> Pathogenic
How the result should look like:
Transcript Gene Variant ClinicalSignificance
1 NG_012236.2 g.11027del Pathogenic
2 NM_018077.3 RBM28 c.1052T>C (p.Leu351Pro) Pathogenic
3 NC_012920.1 m.7445A>G Pathogenic
4 m.7510T>C Pathogenic
5 NC_000023.11 g.(134493178_134493182)_(134501172_134501176)del Pathogenic
I also tried to execute each separator individually, instead of shifting the data to the right, however it also overwrites the remaining data.
Please if anyone could help, appreciates!