Please help me solve this! Part 3 - Find best matching genome for a given sequen
ID: 3872700 • Letter: P
Question
Please help me solve this!
Part 3 - Find best matching genome for a given sequence We have a random DNA sequence, and we want to find the closest species to it. Is the DNA sequence more similar to human, mouse, or unknown? When could this kind of comparison be useful? Suppose that the emergency room of some hospital sees a sudden and drastic increase in patients presenting with a particular set of symptoms. Doctors determine the cause to be bacterial, but without knowing the specific species involved they are unable to treat patients effectively. One way of identifying the cause is to obtain a DNA sample and compare it against known bacterial genomes. With a set of similarity scores, doctors can then make more informed decisions regarding treatment, prevention, and tracking of the disease The goal of this part of the assignment is to write functions that can be useful to determine the identity of different species of bacteria, animals, etcBy simply using the similarity score routine you implemented you can compare an unknown sequence to different genomes and figure out the identity of the unknown sample float findBestMatch(string genome, string seq) The findBestMatch function should take two string arguments and return a floating point value of the highest similarity score found for the given sequence at any position within the genome. In other words, this function should traverse the entire genome and find the highest similarity score by using similarityScore() for the comparisons between seq and each sequential substring of genome hint: this function is very similar in structure to the countMatches function> int findBestGenome(string genome1, string genome2, string genome3, string seq) . The findBestGenome function should take four string arguments(unknown . Return an integer indicating which genome string, out of the 3 given, had the sequence, mouse_genome, human_genome and unknown_genome) highest similarity score with the given sequence For each genome, the function will find the highest similarity score of the sequence (at any position) within that genome (call function findBestMatch described above) . The return value from this function will indicate which genome had the best match, 1, 2, or 3. In the case that two or more of the sequences have the same best similarity score, return 0 COG will grade Part 3 based on both the value returned from findBestGenome and findBestMatch Note: DNA sequences for human, mouse and unknown genomes will be uploaded as a file on Moodle with this assignment for testing purposesExplanation / Answer
main.cpp
----------------------------------------------------------------------
#include <iostream>
#include "genome.cpp"
using namespace std;
int main() {
string humanDNA = "CGCAAATTTGCCGGATTTCCTTTGCTGTTCCTGCATGTAGTTTAAACGAGATTGCCAGCACCGGGTATCATTCACCATTTTTCTTTTCGTTAACTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCATGTGTATTTGCTGTCTCTTAGCCCAGACTTCCCGTGTCCTTTCCACCGGGCCTTTGAGAGGTCACAGGGTCTTGATGCTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGCAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGGGCCATTGTTCATG";
string mouseDNA = "CGCAATTTTTACTTAATTCTTTTTCTTTTAATTCATATATTTTTAATATGTTTACTATTAATGGTTATCATTCACCATTTAACTATTTGTTATTTTGACGTCATTTTTTTCTATTTCCTCTTTTTTCAATTCATGTTTATTTTCTGTATTTTTGTTAAGTTTTCACAAGTCTAATATAATTGTCCTTTGAGAGGTTATTTGGTCTATATTTTTTTTTCTTCATCTGTATTTTTATGATTTCATTTAATTGATTTTCATTGACAGGGTTCTGCTGTGTTCTGGATTGTATTTTTCTTGTGGAGAGGAACTATTTCTTGAGTGGGATGTACCTTTGTTCTTG";
string unkownDNA = "CGCATTTTTGCCGGTTTTCCTTTGCTGTTTATTCATTTATTTTAAACGATATTTATATCATCGGGTTTCATTCACTATTTTTCTTTTCGATAAATTTTTGTCAGCATTTTCTTTTACCTCTTCTTTCTGTTTATGTTAATTTTCTGTTTCTTAACCCAGTCTTCTCGATTCTTATCTACCGGACCTATTATAGGTCACAGGGTCTTGATGCTTTGGTTTTCATCTGCAAGAGTCTGACTTCCTGCTAATGCTGTTCTGTGTCAGGGTGCATCTGAGCACTGATGTGGAGTTTTCTTGTGGATATGAGCCATTCATAGTGTGGGATGTGCCATAGTTCATG";
cout << similarityScore(humanDNA, mouseDNA) << endl;
cout << countMatches(humanDNA, "GCC", 0.5) << endl;
cout << findBestMatch(mouseDNA, "ACT") << endl;
cout << findBestGenome("GGAACACA", "CGATATGA", "GGAGTA", "CAATC") << endl;
}
--------------------------------------------------------------------------------
genome.cpp
-------------------------------------------------------------
#include <iostream>
#include <string>
using namespace std;
/*
Compares two strings together and calculates their similarity
Parameters: first sequence, second sequence
Returns: similarity score for the two sequences
*/
float similarityScore(string sequence1, string sequence2) {
float sequence1Length = sequence1.length();
float sequence2Length = sequence2.length();
if (sequence1Length != sequence2Length) {
return 0;
}
float mismatches = 0.0;
for (int i = 0; i < sequence1Length; i++) {
if (!(sequence1[i] == sequence2[i])) {
mismatches++;
}
}
return (sequence1Length - mismatches) / sequence1Length;
}
/*
Counts all matches of a sequence in a genome with a minimun score consideration
Parameters: genome, sequence, minScore
Returns: all matches within the min score boundary
*/
int countMatches(string genome, string sequence, float minScore) {
int matches = 0;
int sequenceLength = sequence.length();
for (int pos = 0; pos < genome.length() - sequenceLength + 1; pos++) {
float score = similarityScore(genome.substr(pos, sequenceLength), sequence);
if (score >= minScore) {
matches++;
}
}
return matches;
}
/*
Finds the best match in terms of similarity for a genome and a sequence
Parameters: genome, sequence
Returns: best match
*/
float findBestMatch(string genome, string seq) {
float bestMatch = 0.0;
for (int pos = 0; pos < genome.length(); pos++) {
float score = similarityScore(genome.substr(pos, seq.length()), seq);
if (score > bestMatch) {
bestMatch = score;
}
}
return bestMatch;
}
/*
Compares three genomes and returns the index number of the one with the best match for a sequence
Parameters: first genome, second genome, third genome, sequence
Returns: best matching genome for given sequence
*/
int findBestGenome(string genome1, string genome2, string genome3, string seq) {
float genome1Score = findBestMatch(genome1, seq);
float genome2Score = findBestMatch(genome2, seq);
float genome3Score = findBestMatch(genome3, seq);
if (genome1Score > genome2Score && genome1Score > genome3Score) {
return 1;
}
else if (genome2Score > genome1Score && genome2Score > genome3Score) {
return 2;
}
else if (genome3Score > genome1Score && genome3Score > genome2Score) {
return 3;
}
else {
return 0;
}
}