ALGGEN - intro-promo

ALGGEN - intro-promo PROMO help

July 2001
Contact: D. Farre

Contents:

1. Introduction
2. Selection of species and construction of weight matrices
3. Search for motifs in a DNA sequence
4. Interpretation of results
5. Specificity of weight matrices
6. Multiple search for motifs in a set of DNA sequences

Introduction

Many binding sites for transcription factors have been experimentally identified and this information can be used to perform computational-based searches to characterise new gene regulatory regions. PROMO is a program to predict potential transcription factor binding sites in DNA sequences among those already experimentally identified. The TRANSFAC database (Wingender et al., 2001) is used as the source of known binding sites and transcription factors. Weight matrices representing the binding sites are constructed in a dynamic fashion from factor-specific collections of sites. Differential features of PROMO in respect to other programs are the possibility to reduce or expand the number of species that are used as a source of known sites/factors and the provision in the output of information on other genes known to contain sites or combination of sites among those predicted in the query sequence.

Use. PROMO can be used to guide experiments in the lab and to identify regulatory sites shared by different DNA sequences. Different taxonomic levels can be explored. It is important to bear in mind that many of the predicted hits may not represent true sites. Low expectation values and the presence of the predicted sites in functionally-related genes may enhance the confidence in particular predictions.

Selection of species and construction of weight matrices

The intrinsic sequence variability of factor binding sites can be represented as IUPAC consensi or as weight matrices. The latter store information on the frequency of the different nucleotides along the site, are generally more specific (less false positives) and allow rating of the matches. PROMO constructs weight matrices that represent factor-specific binding sites experimentally identified in a particular species or group of species. The source of site sequences and transcription factors is TRANSFAC, which contains known transcription regulatory sites, mostly from eukaryotes.

Selection of species. The user can select the taxonomic level for both binding site and transcription factor, using a taxonomic tree automatically derived from the organism annotations in TRANSFAC entries. This can be done with the module 'SelectSpecies'. The 'site' and 'factor' fields are shown separately since some of the experiments on DNA recognition proteins have been performed in hybrid systems (for example binding of a human factor to a yeast DNA sequence). In most cases selecting at the 'site' level alone will be appropiate. Alternatively, one can select the same taxonomy group for both 'site' and 'factor', results should be similar.

Weight matrix construction. Once the species has been selected weight matrices derived from at least three different binding sites per transcription factor are constructed. Matrix construction is performed on the fly, which may take up to a few minutes. However, in many cases matrices have already been pre-calculated (for example for humans or for all organisms) so there is no need to wait. The algorithm searches for the longest completely conserved sub-sequence (core) to anchor the alignment of the binding sites. A subset of the sequences may be discarded if this leads to an increase in the number of completely conserved sites, provided that at least half of the original sequences are maintained in the final alignment. Any terminal sequences containing gaps are eliminated. Weight matrices containing the number of ocurrences of each nucleotide in each sequence position are derived from the alignments. They can be inspected by using the module 'ViewMatrices', which also shows expectation values at different levels of similarity and a list of genes that contain the sites represented in the matrix.

Search for motifs in a DNA sequence

Input query sequence. Using the module 'SearchSite' the user can input a sequence for inspection of matches to factor-specific weight matrices. The sequence can be copy/pasted into a sequence window or alternatively downloaded from a file. The dissimilarity treshold is the parameter that controls how similar a sequence must be to the matrix to be reported as a hit. Default is 15% (85% similarity) but can be modified by the user.

Similarity of a sequence to the matrix. Sequences of the same length as the number of positions in each different weight matrix are rated by PROMO for their similarity to the matrix using the method of Quandt et al. (1995), which is based on the Shannon entropy. All possible sequences that are above the similarity threshold to the different matrices are stored in an automata. Exact matches of the query sequence to the automata will represent the predictions. Both strands are searched for matches.

Output of the program

Matches. The predictions are shown on the corresponding position of the query sequence, including the name of the factor that binds to the motif and the similarity index. One should be aware that many of the hits are unlikely to be real sites.

Expectation values. To give a measure of the reliability of the different predictions, the expectation of finding each of the matches, in a random sequence of 1000 nucleotides, is calculated, considering both a model with equiprobability of the four nucleotides and a model with the same nucleotide frequency as the query sequence. The E values depend on the similarity of the sequence to the matrix so several E values are provided for different similarity levels. The E value corresponding to the similarity of the match will indicate how reliable is the hit, for example a value of 0.001 will mean that the hit is only expected to occur by chance once in 1 Mb of sequence while a value of 1 will be expected once in 1 Kb of sequence.

Links to other genes. In addition, for each of the matches and combinations of matches, a list of genes that are known to contain the regulatory motifs is shown. The distribution of motifs in these genes can be visualised through a graphical representation and more detailed information for each gene, extracted from TRANSFAC, can be consulted by using hyperlinks. Functional relationships between genes may point out to particular biologically meaningful sites.

Specificity of weight matrices

We have introduced a new measure that computes the specificity of a matrix. Each column of the weight matrix can be easily trasformed to represent a vector containing the probability of each nuceotide, then we define the specificity of each column as the euclidian distance between this vector and a vector where all nucleotides are equiprobable, and the specificity of a matrix as a whole as the normalised average distance for all columns. This definition of specificity is highly correlated with those defined as the average of the Shannon Entropy of all colums, but allows the introduction of different methods to compute the distances with lower computational cost.

The module 'MatrixSpecificity' pictures the specificity distribution of the matrices. This can be done for a given taxon (species or group of species) or it can also be performed in a comparative manner for two different taxons, taking those matrices that correspond to the same factor (a single TRANSFAC entry may include homologues). In the application, once the "BROWSE" button has been pressed, the list of matrix sets stored in our server appears. When the user selects an item from this list (by clicking with the mouse) the data are loaded from the server and the item is printed into the window below, then the user can select the item from this second window and see the specificity of the matrices by pushing the "DRAW" buttom (and clicking over the pixels the user can see the factor). The applet pictures the specificity of the human_human matrices (factor species_site species). The white pixels over the diagonal represent the specificity of matrices and the values range from 0 (lower-left corner) to 1 (upper-right corner).

If the user wish to compare the specificity of two sets of matrices then select two items from the second window and press the "DRAW" button. For instance, assume that the user compares the specificity between human_human and vertebrate_vertebrate matrices that correspond to the same transcription factor. As was to be expected there are many pixels in the diagonal and more pixels over the diagonal that below it. The firts fact means that the factor's matrix has the same specificity in both species, meanwhile the second factor suggest that the huma_human matrices are more specific.

Multiple search for motifs in a set of DNA sequences

The increasingly widespread use of microarray-based expression profiling in recent last years is generating an enormous amount of data. However, information about transcriptional regulation, responsible to a large extent for the observed profiles, is not contained within the arrays created via the cDNA sequences of the genes. Most of the transcriptional regulatory signals are likely to be in the promoter and enhacer/silencer regions of genes. Promoter sequences corresponding to the cDNAs can be discovered by exon mapping or promoter prediction tools from genome sequences (Hannenhalli et al. 2001; Liu et al. 2002). The complete sequencing of some genomes now makes this possible. Then a comparative promoter analysis of the co-expressed genes, with the aim to identify common transcription regulatory signals, can be applied.

We have developed a new version of PROMO that includes an option for comparing 2 or more promoter (or enhancer/silencer) sequences. Using the data contained in TRANSFAC, the 'MultiSearchSites' option of PROMO provides information on the binding site predictions that appear in a minimum number (determined by the user) of the input sequences. This new option of PROMO can be applied to analyse co-expressed genes. It can also be used to compare promoter sequences of orthologous genes from different genomes.