I do not have very much experience with REs, but need to parse 100s of file names to generate a 'metadata' data set. I have been able to generate text files that include the file paths and the file name. It is simple for me to parse out the complete file name, but I need to be able to parse out the "sample ID" from the file name.
The issue is that the syntax of the "sample IDs" is all over the place (See attached csv for example data: The goal is to go from the 'sample' column to the 'ID' column). I have tried a series of strsplit() commands, but this is very cumbersome, and is not functional in nature. I have also tried writing a function with a number of IF statements based on syntax structure. I feel like this is still not a good solution because it is still dependent on me manually identifying the different syntax before hand, and I could easily miss something since I have to do this by eye.
It seems to me that this is a regex problem, but I could use some resources to help me get started. I would like to be able to do this in either R or Python if possible. Thank you for any resources, or packages/modules that may be useful.
dput(head(brain_ref, 25))
structure(list(file = c("/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/BXH12_1_brain_total_RNA_cDNA_GTCCGC.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/BXH12_2_brain_total_RNA_cDNA_CAGATC.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB13_1_brain_total_RNA_cDNA_ATGTCA.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB13_2_brain_total_RNA_cDNA_GTGAAA.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB17_1_brain_total_RNA_cDNA_CCGTCC.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB17_2_brain_total_RNA_cDNA_ATGTCA.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB2_1_brain_total_RNA_cDNA_GTCCGC.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB2_2_brain_total_RNA_cDNA_CTTGTA.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB25_1_brain_total_RNA_cDNA_AGTTCC.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB25_2_brain_total_RNA_cDNA_AGTCAA.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB27_1_brain_total_RNA_cDNA_CGATGT.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB27_2_brain_total_RNA_cDNA_AGTTCC.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB7_1_brain_total_RNA_cDNA_ACAGTG.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB7_2_brain_total_RNA_cDNA_AGTCAA.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/SHR_1_brain_total_RNA_cDNA_GCCAAT.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/SHR_2_brain_total_RNA_cDNA_TGACCA.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch10/ensembl_v96/ACI-SegHsd-2-brain-total-RNA_S17.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch10/ensembl_v96/BXH2-3-brain-total-RNA_S4.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch10/ensembl_v96/BXH5-3-brain-total-RNA_S3.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch10/ensembl_v96/BXH8-3-brain-total-RNA_S5.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch10/ensembl_v96/Cop-CrCrl-2-brain-total-RNA_S10.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch10/ensembl_v96/Dark-Agouti-1-brain-total-RNA_S16.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch10/ensembl_v96/Dark-Agouti-2-brain-total-RNA_S13.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch10/ensembl_v96/F344-NCI-1-brain-total-RNA_S18.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch10/ensembl_v96/F344-NCI-2-brain-total-RNA_S15.genes.results"
), sample = c("BXH12_1_brain_total_RNA_cDNA_GTCCGC", "BXH12_2_brain_total_RNA_cDNA_CAGATC",
"HXB13_1_brain_total_RNA_cDNA_ATGTCA", "HXB13_2_brain_total_RNA_cDNA_GTGAAA",
"HXB17_1_brain_total_RNA_cDNA_CCGTCC", "HXB17_2_brain_total_RNA_cDNA_ATGTCA",
"HXB2_1_brain_total_RNA_cDNA_GTCCGC", "HXB2_2_brain_total_RNA_cDNA_CTTGTA",
"HXB25_1_brain_total_RNA_cDNA_AGTTCC", "HXB25_2_brain_total_RNA_cDNA_AGTCAA",
"HXB27_1_brain_total_RNA_cDNA_CGATGT", "HXB27_2_brain_total_RNA_cDNA_AGTTCC",
"HXB7_1_brain_total_RNA_cDNA_ACAGTG", "HXB7_2_brain_total_RNA_cDNA_AGTCAA",
"SHR_1_brain_total_RNA_cDNA_GCCAAT", "SHR_2_brain_total_RNA_cDNA_TGACCA",
"ACI-SegHsd-2-brain-total-RNA_S17", "BXH2-3-brain-total-RNA_S4",
"BXH5-3-brain-total-RNA_S3", "BXH8-3-brain-total-RNA_S5", "Cop-CrCrl-2-brain-total-RNA_S10",
"Dark-Agouti-1-brain-total-RNA_S16", "Dark-Agouti-2-brain-total-RNA_S13",
"F344-NCI-1-brain-total-RNA_S18", "F344-NCI-2-brain-total-RNA_S15"
), batch = c("batch1", "batch1", "batch1", "batch1", "batch1",
"batch1", "batch1", "batch1", "batch1", "batch1", "batch1", "batch1",
"batch1", "batch1", "batch1", "batch1", "batch10", "batch10",
"batch10", "batch10", "batch10", "batch10", "batch10", "batch10",
"batch10"), ID = c("BXH12_1", "BXH12_2", "HXB13_1", "HXB13_2",
"HXB17_1", "HXB17_2", "HXB2_1", "HXB2_2", "HXB25_1", "HXB25_2",
"HXB27_1", "HXB27_2", "HXB7_1", "HXB7_2", "SHR_1", "SHR_2", "ACI-SegHsd_2",
"BXH2_3", "BXH5_3", "BXH8_3", "Cop-CrCrl_2", "Dark-Agouti_1",
"Dark-Agouti_2", "F344-NCI_1", "F344-NCI_2")), row.names = c(NA,
25L), class = "data.frame")