Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

In this assignment, you will develop a few functions for DNA analysis. These fun

ID: 3797664 • Letter: I

Question

In this assignment, you will develop a few functions for DNA analysis. These functions will calculate common measures of DNA similarity, such as the Hamming distance and the Best Match between two DNA sequences. Each of the DNA sequences you need for this assignment can be copied from this write-up and stored in a variable in your program. There is a sample DNA sequence for a mouse, human, and an unknown species. Your mission is to determine the identity of the unknown by comparing it to the human and the mouse. If the unknown species is more similar to the human than it is to the mouse, then you can conclude that the unknown sequence is from a human. Otherwise, you can conclude that the unknown is from a mouse.

Your assignment needs to include at least the following functions for full credit:

void calcSimilarity(double *similarity, string DNA1, string DNA2);

string calculateBestMatch(double *bestscore, int *index, string DNA1, string DNA2);

Hamming distance is one of the most common ways to measure the similarity between two strings of the same length. Hamming distance is a position-by-position comparison that counts the number of positions in which the corresponding characters in the string are different. Two strings with a small Hamming distance are more similar than two strings with a larger Hamming distance.

Example: first string = “ACCT” second string = “ACCG” A C C T | | | * A C C G

In this example, there are three matching characters and one mismatch, so the Hamming distance is one.

The similarity score for two sequences is then calculated as follows: similarity_score = (string length-hamming distance) / string length similarity_score = (4-1)/4=3/4=0.75

Two sequences with a high similarity score are more similar than two sequences with a lower similarity score. The Best Match algorithm extends the Hamming distance calculation by finding the best overlap of the two strings. For any two strings, calculate the Hamming distance between the string and substring starting at each position of the string. calcSimilarity(double*, string, string)

The calcSimilarity() function should take two arguments that are both strings and a double pointer that stores the similarity between the strings. You can declare a double pointer just as you would an integer pointer:

double x;

double *dPtr = &x;

The function should calculate the similarity score for the two strings and update the similarity with that score.

Note: when you test calcSimilarity(), pass in strings where you can calculate the similarity by hand before passing it real data. That will help you identify errors in your algorithm.

calculateBestMatch(double*, int*, string, string)

The calculateBestMatch() function should take four arguments - one integer pointers and double pointer and two strings. The double pointers store the Similarity Score calculation and the integer pointer store the index in the string where the best match starts. The two string arguments are the two strings to compare. The second string argument is the substring to search for. The first string is the string you are searching. This functions returns a string which is DNA sequence from the mouse/human DNA which best matches with the user entered sequence with a high similarity score. Note: you will need to be aware of the end of each string to make sure that you don’t loop off the end of either string.

Functionality in main()

In your main() function, you will need to call the other functions you have written. You need to use the mouse and human DNA samples shown below in this write-up and unknown DNA sample just for testing your program. Your first task is to ask the user to enter the unknown DNA sequence and store it in a variable. You should output the result of the function calls in the main() function. After calling calcSimilarity(), you need to output the identity of the unknown DNA sequence.

if the unknownDNA is more similar to the humanDNA

print “Human”

else if the unknownDNA is more similar to the mouseDNA

print “Mouse”

else unknownDNAis equally similar to both mouse and human

print “Identity cannot be determined.”

Before calling calculateBestMatch(), you need to prompt the user for a search string. You need to compare the search string to the mouse DNA and Human DNA, you would do something like the following:

cout<<”Enter a substring:;

getline(cin, subStr);

calculateBestMatch(&similarityscore, &index, mouseDNA, subStr);

calculateBestMatch(&similarityscore, &index, humanDNA, subStr);

After calling calculateBestMatch(), you need to display the DNA sequence that is the best match as well as the best similarity score. If there isn’t a match of any character, print “Match not found.”

Explanation / Answer

#include <iostream>

#include <cstring>

#include <vector>

#include <string>

using namespace std;

static const char* DNAdata = "ACTGCGACGGTACGCTTCGACGTAG";

static const size_t len = strlen(DNAdata);

vector< vector< string > > uniqueKeys(len);

vector< vector< vector<size_t> > > locations(len);

void saveInfo(const char* str, size_t n, size_t loc) {

   vector<string>& keys = uniqueKeys[n-1];

   vector<vector<size_t> >& locs = locations[n-1];

   bool found = false;

   for (size_t i=0; i<keys.size(); ++i) {

      if (keys[i] == str) {

     locs[i].push_back(loc);

     found = true;

     break;

      }

   }

   if (!found) {

      vector<size_t> newcont;

      newcont.push_back(loc);

      keys.push_back(str);

      locs.push_back(newcont);

   }

}

void printInfo(const char* str) {

   cout << str << endl;

   size_t len = strlen(str);

   vector<string>& keys = uniqueKeys[len-1];

   vector<vector<size_t> >& locs = locations[len-1];

   for (size_t i=0; i<keys.size(); ++i) {

      if (keys[i] == str) {

     vector<size_t>& l = locs[i];

     vector<size_t>::iterator iter = l.begin();

     for (; iter != l.end(); ++iter) {

        cout << *iter << endl;

     }

     break;

      }

   }

}

int main() {

   char* DNA = new char[len+1];

   strcpy(DNA, DNAdata);

   char* end = DNA+len;

   char* start = DNA;

   for (size_t n =3; n<=len; ++n) {

      size_t loc = 0;

      char* p = start;  

      char* e = p+n;

      while (e <= end) {    

     char save = *e;

     *e = 0;

     saveInfo(p++, n, loc++);

     *e = save;

     ++e;

      }

   }

   delete[] DNA;

   printInfo("GTA");

   printInfo("ACTGCGACGGTACGCTTCGACGTA");

   return 0;

}

To print all:

void printAll() {

   for (size_t n=3; n<=len; ++n) {

      cout << "--> " << n << " <--" << endl;

      vector<string>& keys = uniqueKeys[n-1];

      vector<vector<size_t> >& locs = locations[n-1];

      for (size_t i=0; i<keys.size(); ++i) {

     cout << keys[i] << endl;

     vector<size_t>& l = locs[i];

     vector<size_t>::iterator iter = l.begin();

     for (; iter != l.end(); ++iter) {

        cout << *iter << endl;

     }

      }

   }

}