The following should be done in a python code editor and should run on your comp
ID: 3806891 • Letter: T
Question
The following should be done in a python code editor and should run on your computer's terminal.
1. Your task for this exercise is to generate an amino acid usage report with counts only, and in no particular order. Do this by opening the file below, stripping the lines without amino acid text, and using a dictionary to store each of the 21 amino acids used along with their count as the value. Your output should look like this:
T: 69645 G: 95475 V: 91683 Y: 36836 H: 29255 .....
2. Modify your script from #1 to display only the top 5 most frequently used amino acids and add their percentage use. The output should be like this:
L: 139002 (10.7%) A: 123885 (9.6%) G: 95475 (7.4%) V: 91683 (7.1%) I: 77836 (6.0%)
A small version of the text we were given is shown below:
>gi|170079664|ref|YP_001728984.1| thr operon leader peptide [Escherichia coli st
r. K-12 substr. DH10B]
MKRISTTITTTITITTGNGAG
>gi|170079665|ref|YP_001728985.1| bifunctional aspartokinase I/homeserine dehydr
ogenase I [Escherichia coli str. K-12 substr. DH10B]
MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDALPNISDAERI
FAELLTGLAAAQPGFPLAQLKTFVDQEFAQIKHVLHGISLLGQCPDSINAALICRGEKMSIAIMAGVLEA
RGHNVTVIDPVEKLLAVGHYLESTVDIAESTRRIAASRIPADHMVLMAGFTAGNEKGELVVLGRNGSDYS
AAVLAACLRADCCEIWTDVDGVYTCDPRQVPDARLLKSMSYQEAMELSYFGAKVLHPRTITPIAQFQIPC
LIKNTGNPQAPGTLIGASRDEDELPVKGISNLNNMAMFSVSGPGMKGMVGMAARVFAAMSRARISVVLIT
QSSSEYSISFCVPQSDCVRAERAMQEEFYLELKEGLLEPLAVTERLAIISVVGDGMRTLRGISAKFFAAL
ARANINIVAIAQGSSERSISVVVNNDDATTGVRVTHQMLFNTDQVIEVFVIGVGGVGGALLEQLKRQQSW
LKNKHIDLRVCGVANSKALLTNVHGLNLENWQEELAQAKEPFNLGRLIRLVKEYHLLNPVIVDCTSSQAV
ADQYADFLREGFHVVTPNKKANTSSMDYYHQLRYAAEKSRRKFLYDTNVGAGLPVIENLQNLLNAGDELM
KFSGILSGSLSYIFGKLDEGMSFSEATTLAREMGYTEPDPRDDLSGMDVARKLLILARETGRELELADIE
IEPVLPAEFNAEGDVAAFMANLSQLDDLFAARVAKARDEGKVLRYVGNIDEDGVCRVKIAEVDGNDPLFK
VKNGENALAFYSHYYQPLPLVLRGYGAGNDVTAAGVFADLLRTLSWKLGV
Explanation / Answer
1) Developing a program for this task involves following steps:
a) Declaring an array list of 21 amino acids
b) Declaring a dictionary to store usage count of each amino acid
c) Opening given file and reading each line in the file
d) Updating usage count for each amino acid for each line read. Note that lines starting with ">gi" are skipped as these lines do not contain amino acid text.
e) Finally, output the amino acid usage report #1.
Python program for generating amino acid usage report is given below.
File Name: generateAminoAcidReport-1.py
import sys
def main(args):
if len(args) < 2:
print 'Usage: generateAminoAcidReport-1.py <file>'
return
aminoAcids = [ 'A', 'C', 'D', 'E', 'F',
'G', 'H', 'I', 'K', 'L',
'M', 'N', 'P', 'Q', 'R',
'S', 'T', 'V', 'W', 'Y',
'C' ]
aminoAcidUsageCounts = { }
# Initialize amino acid usage counts to zero
for acid in aminoAcids:
aminoAcidUsageCounts[acid] = 0
# Compute amino acid usage counts
for line in open(args[1]):
# Skip lines starting with ">gi"
if line.startswith(">gi"):
continue
for acid in aminoAcids:
aminoAcidUsageCounts[acid] = aminoAcidUsageCounts[acid] + line.count(acid)
# Generate amino acid usage report
report = ""
for acid in aminoAcids:
report = report + acid + ":" + str(aminoAcidUsageCounts[acid]) + " "
print report
if __name__ == '__main__':
main(sys.argv)
Above python script file may be run in the following way.
python generateAminoAcidReport-1.py inputFile.txt
Sample output is shown below:
A:92 C:24 D:46 E:54 F:30 G:66 H:16 I:51 K:37 L:89 M:24 N:39 P:29 Q:30 R:47 S:52 T:42 V:69 W:4 Y:20 C:24
2) Developing a program for this task involves following additional steps:
Steps a) to d) are same as mentioned above
e) Compute total usage count
f) Generate report by iterating through only five amino acids with top count. Print count along with percentage for each of the top 5 amino acids.
Python program for generating amino acid usage report is given below.
File Name: generateAminoAcidReport-2.py
import sys
import operator
def main(args):
if len(args) < 2:
print 'Usage: generateAminoAcidReport-2.py <file>'
return
aminoAcids = [ 'A', 'C', 'D', 'E', 'F',
'G', 'H', 'I', 'K', 'L',
'M', 'N', 'P', 'Q', 'R',
'S', 'T', 'V', 'W', 'Y',
'C' ]
aminoAcidUsageCounts = { }
# Initialize amino acid usage counts to zero
for acid in aminoAcids:
aminoAcidUsageCounts[acid] = 0
# Compute amino acid usage counts
for line in open(args[1]):
# Skip lines starting with ">gi"
if line.startswith(">gi"):
continue
for acid in aminoAcids:
aminoAcidUsageCounts[acid] = aminoAcidUsageCounts[acid] + line.count(acid)
# Compute total usage count
totalUsageCount = 0
for acid in aminoAcids:
totalUsageCount = totalUsageCount + aminoAcidUsageCounts[acid]
# Generate top five amino acid usage count with percentage
report = ""
for acid, count in sorted(aminoAcidUsageCounts.items(), key=operator.itemgetter(1), reverse=True)[:5]:
percent = count * 100.0 / totalUsageCount
report = report + acid + ": " + str(count) + " (" + "%.1f" % percent + "%) "
print report
if __name__ == '__main__':
main(sys.argv)
Above python script file may be run in the following way.
python generateAminoAcidReport-2.py inputFile.txt
Sample output is shown below:
A: 92 (10.4%) L: 89 (10.1%) V: 69 (7.8%) G: 66 (7.5%) E: 54 (6.1%)