Help Fasta Format



In bioinformatics, FASTA format is a file format used to exchange information between genetic sequence databases. Its format looks like this:

homo.fna

>AB000263 |acc=AB000263|descr=Homo sapiens mRNA for prepro cortistatin like peptide, complete cds.|len=368
ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCC ACGGCCACCGCTGCCCTGCCCCTGGAGGGTGGCCCCACCGGCCGAGACAG CGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCCTCCTGACTT TCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTC ATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCC CCAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTTCTTCTGGA AGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAGTTTAATTACAGACCTGAA

It consists of a header line (beginning with a '>') which gives a name and/or a unique identifier for the sequence, and often lots of other information too. Many different sequence databases use standarized headers, which helps when automatically extracting information from the header. After the header line and comments, one or more sequence lines may follow: each line of a sequence should have fewer than 80 characters.

Sequences may be protein sequences or DNA sequences, and they can contain gaps or alignment characters. In this case, the sequences must include only the characters 'A' (Adenosine), 'C' (Cytidine), 'G' (Guanine) and 'T' (Thymidine).

Sometimes, there are unknown or irrelevant zones in the sequences. The standard way to represent these bases is with letter 'N'. In searches, FREQ treats it, considering the four possibilities of substitution (A, C, G or T). Thus showing more results of those than they would occur if it were known which is the base that there is.

Fasta format files often have file extensions like .fasta, .fa, .mpfa, .fna, or .fsa, (and probably many more!).