This application presupposes that the user has some basic knowledgement
about DNA and its behaviour, once this requirement is fulfilled we will start explaining
the application's objective.
Actually, the Shotgun is a well-known DNA sequencing method, this method has three
stages: the first one consists in replicate the ADN strand you want to sequence, the second stage
breaks up the strands randomly (the original and the replicated ones) and the last stage
tries to regenerate the initial strand using the information obtained from the matching between
the strand's pieces. This last stage is the most difficult and is affected by the errors that appear in the two previous stages.
Well, the main problem is: when the original strand is assembled from the strand's pieces, every pair of pieces must be compared to obtain the best match between them. Every pair of strand's pieces
must be compared and that problem has a maximum cost of O(n^2) (where n is the number of pieces),
this must be multiplied by the cost of every comparation. The algorithm that returns the best
match for two given pieces has a maximum cost of O(m^2) (where m is the number of bases of the
largest piece), so the comparation process has a maximum cost of O(n^2) * O(m^2), considering
that the piece's number of bases is similar to the number of pieces, we will have a maximum
cost of O(n^4), this cost is computationally unacceptable if we are using a n with a magnitude near
or above thousand.
As you have seen, the computational cost of this algorithm is unaffordable, that's why
the approach to the problem must be changed: we will not use brute force, instead of that we
firstly will use an algorithm with pair by pair selection, that way we will only store the
candidate pairs of pieces, those where is possible to find a match; that way the
initial number of comparations will be reduced in an order of magnitude.
This method's cost will be O(n^2) * O(m) = O(n^3) (supposing m and n with similar magnitude),
this cost is, opposite to the first one, computationally affordable.
This didactic program will try to explain the behaviour of the selection algorithm, this
algorithm is based in the search of matching words inside the strands, the minimum size of
these groups will be selected by the user. This way the parameters of the algorithm will
be two ADN strands (in fact they are two pieces of the original strands) and a numeric selection
parameter that will define the accuracy of the algorithm.