procrastAligner: An Efficient Local Multiple Alignment Filtration Tool

about

Maintained by Aaron Darling, Todd Treangen

About procrastAligner

procrastAligner is an efficient local multiple alignment heuristic for identifcation of conserved regions in one or more DNA sequences. More specifically, procrastAligner has been designed for local multiple alignment of interspersed DNA repeats. The algorithm consists of seven main steps:

(1) palindromic spaced seed patterns to match both DNA strands simultaneously,
(2) seed extension (chaining) in order of decreasing multiplicity,
(3) procrastination when low multiplicity matches are encountered,
(4) gapped extension of seed chains,
(5) detect unrelated regions using a hidden Markov model,
(6) apply transitive homology relationships, and
(7) removal of any unrelated sequence from the final local multiple alignment.

The emission probabilities for each possible pair of aligned nucleotides in our HMM for detecting unrelated regions in the local multiple alignments were extracted from the HOXD substitution matrix presented by Chiaromonte et al 2002 "Scoring Pairwise Genomic Sequence Alignments". Further details on how we extracted & implemented these values can be found here.

For further details on the procrastAlign algorithm:

Gapped Extension for Local Multiple Alignment of Interspersed DNA Repeats. Todd J Treangen, Aaron E Darling, Mark A. Ragan, Xavier Messeguer. Lecture Notes in Bioinformatics 4983. pp. 74–86, 2008.
c Springer-Verlag Berlin Heidelberg 2008. [pdf]

Procrastination leads to efficient filtration for local multiple alignment. Aaron E Darling, Todd J Treangen, Louxin Zhang, Carla Kuiken, Xavier Messeguer, Nicole T. Perna. Lecture Notes in Bioinformatics 4175. pp. 126–137, 2006. c Springer-Verlag Berlin Heidelberg 2006. [pdf]

Download

The procrastAligner source can currently be downloaded as part of the mauveAligner/libMems/libGenome source tree snapshot here: https://gel.ahabs.wisc.edu/mauve/source/snapshots/

A universal Mac OS X binary is here: procrastAligner-osx.gz
A linux 32-bit binary is here: procrastAligner-linux.gz
A windows 32-bit binary is here: procrastAligner-win.zip

procrastAligner takes a FastA formatted sequence as input, and outputs a list of local multiple alignments in eXtended Mutil-FastA format & XML, sorted by multiplicity and Sum-of-Pairs score. procrastAligner can be ran simply by: procrastAligner --sequence=fasta_sequence_file.fna. Running: procrastAligner --help provides a complete list of command-line options. Some of the most commonly configured options are:

--w = max gap width (distance between two chained match components)
--z = seed weight (--z=15 or --z=17 is a good starting point for 1-10 Mb genomes)
--l = minimum repeat length, after chaining and gapped extension
--rmin = the minimum seed multiplicity (copy number)
--rmax = the maximum seed multiplicity (copy number)
--output = where to store the program output
--xmfa = the local multiple alignments in XMFA format
--xml = the local multiple alignments in XML format

Palindromic Spaced Seeds

Palindromic spaced seeds used by procrastAligner. The sensitivity ranking of a seed at various levels of sequence identity is given in the columns at right. A seed with rank 1 is the most sensitive seed pattern for a given weight and percent sequence identity. The first listed seed of a given weight is used by default unless otherwise specified by the user. A table of all of the seeds follows:

*Palindromic Spaced Seed Table*
Weight	Pattern	Seed Rank by Sequence Identity
Weight	Pattern	65%	70%	75%	80%	85%	90%
5	11111	1	1	1	1	1	1
	11111	2	2	2	2	2	7
	11111	3	3	3	3	3	2
6	111*111	1	1	1	1	1	1
	11*11**11	2	2	2	2	2	3
	1111*11	3	3	3	3	3	1
7	11*111*11	1	1	1	1	1	1
	1111111	2	2	2	2	2	2
	1111111	3	3	3	3	3	3
8	11111**111	1	1	1	1	1	1
	111*11**111	2	2	2	2	2	2
	11*1111**11	4	4	3	4	4	4
9	111111111	1	1	1	1	1	1
	111111111	3	2	2	2	2	2
	111*111*111	2	3	3	3	3	3
10	111111*1111	1	1	1	1	1	1
	1111111111	5	3	2	2	2	2
	1111111111	2	2	3	3	3	3
11	1111*111*1111	1	1	1	1	1	2
	11111111111	3	2	2	2	2	2
	11111111111	9	6	3	3	3	3
12	1111*1111**1111	5	3	1	1	1	2
	111111111111	1	1	3	3	2	3
	111111**111*111	3	2	2	2	3	6
13	1111111111111	>10	5	1	1	1	2
	1111111111111	2	1	2	2	2	2
	1111111111111	5	3	4	3	4	6
14	1111*111111**1111	2	2	1	1	1	1
	1111111*111*1111	1	1	2	2	2	2
	11111111111111	4	4	3	3	4	5
15	111111111111111	1	1	1	1	1	1
	111111111111111	3	2	2	2	3	4
	1111*1111111*1111	5	3	3	3	2	2
16	1111111111111111	2	1	1	1	1	1
	1111111111111111	5	4	2	2	2	2
	11111*111111**11111	4	3	4	3	3	4
18	11111*11111111*11111	1	1	1	1	1	1
	111111111111111111	2	2	2	2	2	2
	111111111111*111111	>10	6	3	3	3	3
19	1111111111111111111	5	2	1	1	1	1
	1111111111111111111	6	4	2	3	4	6
	1111111111111111111	1	1	4	10	>10	>10
20	11111111*1111*111*11111	>10	>10	3	1	1	1
	1111111111111111111	1	1	8	>10	>10	>10
	1111111111111111111	>10	>10	1	2	3	3
21	111111111111111111111	1	1	1	3	3	2
	111111*111111111*111111	>10	3	2	1	1	1
	111111111111111111111	3	2	4	10	>10	7

Acknowledgments

We would also like to thank Webb Miller for providing the HOXD training data. We are grateful to Guillaume Achaz for helpful discussions on the gapped extension algorithm. Accuracy evaluations utilized a compute resource grant from the Australian Partnership for Advanced Computing. AED was supported by NSF grant
DBI-0630765. TJT was supported by Spanish Ministry MECD Grant TIN2004-03382 and AGAUR Training Grant FI-IQUC-2005.

Site maintained by Aaron Darling, Todd Treangen
Copyright 2008
Last Updated: May 1st, 2008