I've been tasked with calculating the GC content of a FASTA file using base R (no packages). My problem is that I don't know how to pragmatically iterate through the sequence while storing the sequence name and also the number of Cs and Gs.
Example FASTA file I can read in (as a .txt file):
>T7_promoter
ATTAGACGAG
>T3_promoter
TTTGCGCGAAATTTTTTTTT
*There are no quotes here but the > designates a distinct sequence.
Such that my output will be something conceptually similar to -
T7_promoter: 0.4 (ratio of GC from # of Gs and Cs)
T3_promoter: 0.25
Any and all help is much appreciated. I am currently using readLines()
to pass the file through. I tried using unlist(strsplit())
per element that strsplit()
naturally produces to try and store each sequence as an element in a list. Then I could iterate through each element to get calculations but my executions have not been successful.