Introduction

This application presupposes that the user has some basic knowledgement about DNA and its behaviour, once this requirement is fulfilled we will start explaining the application's objective.

Actually, the Shotgun is a well-known DNA sequencing method, this method has three stages: the first one consists in replicate the ADN strand you want to sequence, the second stage breaks up the strands randomly (the original and the replicated ones) and the last stage tries to regenerate the initial strand using the information obtained from the matching between the strand's pieces. This last stage is the most difficult and is affected by the errors that appear in the two previous stages.

Well, the main problem is: when the original strand is assembled from the strand's pieces, every pair of pieces must be compared to obtain the best match between them. Every pair of strand's pieces must be compared and that problem has a maximum cost of O(n^2) (where n is the number of pieces), this must be multiplied by the cost of every comparation. The algorithm that returns the best match for two given pieces has a maximum cost of O(m^2) (where m is the number of bases of the largest piece), so the comparation process has a maximum cost of O(n^2) * O(m^2), considering that the piece's number of bases is similar to the number of pieces, we will have a maximum cost of O(n^4), this cost is computationally unacceptable if we are using a n with a magnitude near or above thousand.

As you have seen, the computational cost of this algorithm is unaffordable, that's why the approach to the problem must be changed: we will not use brute force, instead of that we firstly will use an algorithm with pair by pair selection, that way we will only store the candidate pairs of pieces, those where is possible to find a match; that way the initial number of comparations will be reduced in an order of magnitude. This method's cost will be O(n^2) * O(m) = O(n^3) (supposing m and n with similar magnitude), this cost is, opposite to the first one, computationally affordable.

This didactic program will try to explain the behaviour of the selection algorithm, this algorithm is based in the search of matching words inside the strands, the minimum size of these groups will be selected by the user. This way the parameters of the algorithm will be two ADN strands (in fact they are two pieces of the original strands) and a numeric selection parameter that will define the accuracy of the algorithm.