c++ For this lab, you will practice while loops, if statements, reading from an
ID: 3563999 • Letter: C
Question
c++
For this lab, you will practice while loops, if statements, reading from an input file, writing to an output
files, and string operations. You are given a file containing protein sequences with the task for finding
motifs. Motifs are certain patterns of amino acids that appear many times in protein sequences which
act as indictors or markers for special regions, genes, mutations, etc. An example of what this file will
look like is below:
>ENSG00000035141|ENST00000037869
MAELQQLRVQEAVESMVKSLERENIRKMQGLMFRCSASCCEDSQASMKQVHQCIERCHVP
LAQAQALVTSELEKFQDRLARCTMHCNDKAKDSIDAGSKELQVKQQLDSCVTKCVDDHMH
LIPTMTKKMKEALLSIGK*
>ENSG00000003137|ENST00000001146
MLFEGLDLVSALATLAACLVSVTLLLAVSQQLWQLRWAATRDKSCKLPIPKGSMGFPLIG
ETGHWLLQGSGFQSSRREKYGNVFKTHLLGRPLIRVTGAENVRKILMGEHHLVSTEWPRS
TRMLLGPNTVSNSIGDIHRNKRKVFSKIFSHEALESYLPKIQLVIQDTLRAWSSHPEAIN
VYQEAQKLTFRMAIRVLLGFSIPEEDLGHLFEVYQQFVDNVFSLPVDLPFSGYRRGIQAR
QILQKGLEKAIREKLQCTQGKDYLDALDLLIESSKEHGKEMTMQELKDGTLELIFAAYAT
TASASTSLIMQLLKHPTVLEKLRDELRAHGILHSGGCPCEGTLRLDTLSGLRYLDCVIKE
VMRLFTPISGGYRTVLQTFELDGFQIPKGWSVMYSIRDTHDTAPVFKDVNVFDPDRFSQA
RSEDKDGRFHYLPFGGGVRTCLGKHLAKLFLKVLAVELASTSRFELATRTFPRITLVPVL
HPVDGLSVKFFGLDSNQNEILPETEAMLSATV*
These lines:
>ENSG00000035141|ENST00000037869
>ENSG00000003137|ENST00000001146
Are the sequence names. They start with '>'. What follows is the protein sequence itself.
Your task, should you choose to accept it, is to read through the given input file and:
? Output the sequence name to a file called motifSequences.txt followed by the amino acids
before the motif, a space, and the amino acids after the motif, if the motif is present in the
sequence.
? Count the number of motifs that exist in the file
? Count the number of protein sequences
? Count the number of amino acid lines without motifs
Notes:
? This lab will involve creating an input file stream
? This lab will involve creating one output file stream for the proteins with the motif
? The only output to the screen will be the number of motifs that exist in the file, the number of
protein sequences, and the number of amino acid lines without motifs
? Because the input file is very large (1.4 MB, 25003 lines), do not copy the input file to your C
account. Instead use this section of code in your program to open the input file stream:
ifstream proteinFile;
proteinFile.open("/nfshome/mw3n/human_aa_chr2_partial.txt");
? Use "SLR" for the motif to search for
Sample output:
To the terminal:
Total number of sequences: 2063
Total number of lines with motifs: 534
Total number of lines without motifs: 22406
First 10 lines in motifSequences.txt:
>ENSG00000040933|ENST00000074304
PPVTRSVDTVNGRMVLPVDESLTEALGIRSKYA KDTLLKSVFGGAICRMYRFPTTDG
>ENSG00000040933|ENST00000074304
NHLRILEQMAESVLSLHVPRQFVKLLLEEDAARVCELEELGELSPCWE RQIVTQYQT
>ENSG00000072080|ENST00000168148
MISRMEKMTMMMKILIMFALGMNYWSCSGFPVYDYDPS DALSASVVKVNSQSLSPYL
>ENSG00000015568|ENST00000016946
IIDDGDSNLSVVKKLPVPLESVKQMLNSVMQELEDYSEGGPLYKNG NADSEIKHSTP
>ENSG00000183091|ENST00000172853
SVRGKVAPTTKTVDLDRALHAYKLQSSNLYKT TLPTGYRLPGDTPHFKHIKDTRYMS