Mauve, multiple genome alignment
Maintained by Aaron Darling, Todd Treangen
About procrastAligner


procrastAligner is an efficient local multiple alignment heuristic for identifcation of conserved regions in one or more DNA sequences. More specifically, procrastAligner has been designed for local multiple alignment of interspersed DNA repeats. The algorithm consists of seven main steps:


(1) palindromic spaced seed patterns to match both DNA strands simultaneously,
(2) seed extension (chaining) in order of decreasing multiplicity,
(3) procrastination when low multiplicity matches are encountered,
(4) gapped extension of seed chains,
(5) detect unrelated regions using a hidden Markov model,
(6) apply transitive homology relationships, and
(7) removal of any unrelated sequence from the final local multiple alignment.

The emission probabilities for each possible pair of aligned nucleotides in our HMM for detecting unrelated regions in the local multiple alignments were extracted from the HOXD substitution matrix presented by Chiaromonte et al 2002 "Scoring Pairwise Genomic Sequence Alignments". Further details on how we extracted & implemented these values can be found here.

For further details on the procrastAlign algorithm:

Gapped Extension for Local Multiple Alignment of Interspersed DNA Repeats. Todd J Treangen, Aaron E Darling, Mark A. Ragan, Xavier Messeguer. Lecture Notes in Bioinformatics 4983. pp. 74–86, 2008.
c Springer-Verlag Berlin Heidelberg 2008. [pdf]

Procrastination leads to efficient filtration for local multiple alignment. Aaron E Darling, Todd J Treangen, Louxin Zhang, Carla Kuiken, Xavier Messeguer, Nicole T. Perna. Lecture Notes in Bioinformatics 4175. pp. 126–137, 2006. c Springer-Verlag Berlin Heidelberg 2006. [pdf]

 

Download

The procrastAligner source can currently be downloaded as part of the mauveAligner/libMems/libGenome source tree snapshot here: http://gel.ahabs.wisc.edu/mauve/source/snapshots/

A universal Mac OS X binary is here: procrastAligner-osx.gz
A linux 32-bit binary is here: procrastAligner-linux.gz
A windows 32-bit binary is here: procrastAligner-win.zip

procrastAligner takes a FastA formatted sequence as input, and outputs a list of local multiple alignments in eXtended Mutil-FastA format & XML, sorted by multiplicity and Sum-of-Pairs score. procrastAligner can be ran simply by: procrastAligner --sequence=fasta_sequence_file.fna. Running: procrastAligner --help provides a complete list of command-line options. Some of the most commonly configured options are:


--w = max gap width (distance between two chained match components)
--z = seed weight (--z=15 or --z=17 is a good starting point for 1-10 Mb genomes)
--l = minimum repeat length, after chaining and gapped extension
--rmin = the minimum seed multiplicity (copy number)
--rmax = the maximum seed multiplicity (copy number)
--output = where to store the program output
--xmfa = the local multiple alignments in XMFA format
--xml = the local multiple alignments in XML format


Palindromic Spaced Seeds

Palindromic spaced seeds used by procrastAligner. The sensitivity ranking of a seed at various levels of sequence identity is given in the columns at right. A seed with rank 1 is the most sensitive seed pattern for a given weight and percent sequence identity. The first listed seed of a given weight is used by default unless otherwise specified by the user. A table of all of the seeds follows:

Palindromic Spaced Seed Table
Weight PatternSeed Rank by Sequence Identity
65%70%75%80%85%90%
511*1*11111111
1**111**1222227
11**1**11333332
6 1*11***11*1111111
11**1*1**11222223
11*1*1*11333331
7 11**1*1*1**11111111
1*11***1***11*1222222
11*1***1***1*11333333
8 111**1**1**111111111
111**1*1**111222222
11**1*1*1*1**11443444
9 111*1**1**1*111111111
111**1**1**1**111322222
111**1*1*1**111233333
10 111*1**1*1**1*111111111
111*1**1**1**1*111532222
111*1**11**1*111223333
11 1111**1*1*1**1111111112
111*1*1**1**1*1*111322222
111**1**1*1*1**1**111963333
12 1111**1*1*1*1**1111531112
1111*1**11**1*1111113323
111*11*1***1*11*111322236
13 1111**1**1*1*1**1**1111>1051112
111*1*11**1**11*1*111212222
111*1**11*1*11**1*111534346
14 1111**11*1*1*11**1111221111
1111*1*11**11*1*1111112222
1111*1*1**11**1*1*1111443345
15 1111*1*11**1**11*1*1111111111
1111*11**1*1*1**11*1111322234
1111**11*1*1*1*11**1111533322
16 1111*1*11**11**11*1*1111211111
111*111**1*11*1**111*111542222
11111**11*1*1*11**11111434334
18 11111**11*1*11*1*11**11111111111
11111*1*11**11**11*1*11111222222
1111*11**11*1*1*11**11*1111>1063333
19 1111*111**1*111*1**111*1111521111
11111*1*11**111**11*1*11111642346
1111*11*111*1*111*11*111111410>10>10
20 11111*1*11**11*11**11*1*11111>10>103111
11111*1*11**111**11*1*11111118>10>10>10
1111*11*111*1*111*11*1111>10>101233
21 11111*111*11*1*11*111*11111111332
111111**11*1*111*1*11**111111>1032111
111111*1*11*111*11*1*11111132410>107

Acknowledgments

We would also like to thank Webb Miller for providing the HOXD training data. We are grateful to Guillaume Achaz for helpful discussions on the gapped extension algorithm. Accuracy evaluations utilized a compute resource grant from the Australian Partnership for Advanced Computing. AED was supported by NSF grant
DBI-0630765. TJT was supported by Spanish Ministry MECD Grant TIN2004-03382 and AGAUR Training Grant FI-IQUC-2005.

Site maintained by Aaron Darling, Todd Treangen
Copyright 2008
Last Updated: May 1st, 2008
BioBanner - free advertising for BioScience web sites.  // advertising brought by BioBanner.org BioBanner - free advertising for BioScience web sites.