Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all articles
Browse latest Browse all 201894

Looking for how to use separate() with multiple separators in R (ClinVar variant data dealing)

$
0
0

Dear StackOverflow community

I'm a biologist and I'm working with a disease/genetic variants from ClinVar official database. My aim is to extract all gene names, transcripts and variants from this list.

ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/ClinVarFullRelease_2020-01.xml.gz

However, ClinVar offers the information I need in a single column called "Name". (I've separated some of the values with different results that I want to deal with in the example in the table below:)

                                                           Name ClinicalSignificance
1                                        NG_012236.2:g.11027del           Pathogenic
2                    NM_018077.3(RBM28):c.1052T>C (p.Leu351Pro)           Pathogenic
3                                         NC_012920.1:m.7445A>G           Pathogenic
4                                                     m.7510T>C           Pathogenic
5 NC_000023.11:g.(134493178_134493182)_(134501172_134501176)del           Pathogenic

(there is other type of data, however since it does not contain the information I need I will treat it as garbage)

I am looking for a way to split the "Name" column in 3 other columns, using multiple separators. I've tried using "|" as part of my regex argument for multiple matches. However, for each time it works, sends the data that has already been separated to a column to the right. My code:

ClinVar_Clean <- separate(ClinVar_Clean, Name, into = c("Transcript","gene.var"),sep = "(?<=\\.[0-9]{1,2})[(]|(?<=[0-9]{3,16}\\.[0-9]{1,2}):|(?=[cmpng]\\.)")
ClinVar_Clean <- separate(ClinVar_Clean, gene.var, into = c("Gene","Variant"),sep = "\\):|(?=[cmpng]\\.)")

My result:

    Transcript  Gene   Variant ClinicalSignificance
1  NG_012236.2            <NA>           Pathogenic
2  NM_018077.3 RBM28                     Pathogenic
3  NC_012920.1            <NA>           Pathogenic
4                    m.7510T>C           Pathogenic
5 NC_000023.11            <NA>           Pathogenic

How the result should look like:

    Transcript      Gene   Variant                                                  ClinicalSignificance
1  NG_012236.2              g.11027del                                                  Pathogenic
2  NM_018077.3      RBM28   c.1052T>C (p.Leu351Pro)                                     Pathogenic
3  NC_012920.1              m.7445A>G                                                   Pathogenic
4                           m.7510T>C                                                   Pathogenic
5  NC_000023.11             g.(134493178_134493182)_(134501172_134501176)del            Pathogenic

I also tried to execute each separator individually, instead of shifting the data to the right, however it also overwrites the remaining data.

Please if anyone could help, appreciates!


Viewing all articles
Browse latest Browse all 201894

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>