Genomalysis

Genomalysis User Guide


PDF Version (250 KB).

Table of Contents


FASTA Files

FASTA is a text only format for sequence information. The files containing sequence information in this format can be given virtually any file extension, for example, ".fsa", ".fasta", ".faa" or ".fna" to name a few. Irrespective of the the extension, FASTA formatted files are simply text files and can be opened by any text editor. Sequence data in FASTA format are presented as amino acid or DNA sequences (single letter code for amino acids) preceded by a header describing what they are. An example of FASTA format:
>lcl|hmm4680 Gene predicted by Gnomon on Homo sapiens Celera genomic contig HsCraAADB02_1 [1|Celera]
MAECGASGSGSSGDSLDKSITLPPDEIFRNLENAKRFAIDIGGSLTKLAY
YSTVQHKVAKVRSFDHSGKVSLHCGHPGQGSRFSVVLDLALVSQSSLCCC
RPRL*
The description line is a single line and is designated by the greater-than symbol. This line can contain any information or no information. The description following the symbol is optional. Everything below this line is considered to be contiguous sequence until another greater-than symbol is encountered. Then, an new sequence begins. Sequence lines are typically not longer than 80 characters. This probably has to due with usability from an era when computers had far less dynamic user interfaces compared with today. Amino acid sequences are typically ended with an asterisk which indicates a stop codon. An example of a multiple sequence FASTA format is as follows:
>lcl|hmm234 Gene predicted by Gnomon on Homo sapiens Celera genomic contig HsCraAADB02_1 [1|Celera]
MVIGHEITHGFDDNGRNFDKNGNMMDWWSNFSTQHFREQSECMIYQYGNY
SWDLADEQNVNGFNTLGENIADNGGVRQAYKAYLKWMAEGGKDQQLPGLD
LTHEQLFFINYAQVAAAVLVPPSPCFPTHLWRAHSGAPPGTRAQHGRPLG
GKA*
>lcl|hmm1170 Gene predicted by Gnomon on Homo sapiens Celera genomic contig HsCraAADB02_1 [1|Celera]
MSTVDLARVGACILKHAVTGEAVELRSLWREHACVVAGLRRFGCVVCRWI
AQDLSSLAGLLDQHGVRLVGVGPEALGLQEFLDGDYFAGELYLDESKQLY
KELGFKRYNSLSILPAALGKPVRDVAAKAKAVGIQGNLSGDLLQSGGLLV
VSKEVPRRLRPQGAHPAGPGHLCGGLCQRPASV*
>lcl|hmm819 Gene predicted by Gnomon on Homo sapiens Celera genomic contig HsCraAADB02_1 [1|Celera]
MRTLPLRFAGDLGTVAEGLPRTWEEGGSAFQSPGAPLRPAAQRGHPQNAR
PGPRRLHAQNPPRASHASCTAAPEARSPWRSQNERRAPGWACGPGGN*
If you are interested in a more detailed discussion of the FASTA format, then take a look at the following web sites:

https://en.wikipedia.org/wiki/FASTA_format

http://zhanglab.ccmb.med.umich.edu/FASTA/

Genomalysis FASTA Usage Notes. There are a couple of ways that Genomalysis uses FASTA files that should be noted by the user. First, when Genomalysis executes a series of filters on sequences contained in an input file, it deletes the asterisk from sequences as they are being fed through the filter set. This does not alter the input file but the asterisks are not rewritten in the output file. This is done to keep the asterisks from causing problems with the filter algorithms. When Genomalysis encounters a sequence that has an asterisk somewhere within the sequence instead of at the end (this does happen), it deletes the asterisk and treats the sequence like a normal sequence. This is important for the user to be aware of because these sequences are artifacts of automated sequence mining protocols and Genomalysis treats them like any other sequence. Additionally, if they pass all the filters they will be written to the output file with no asterisk, so it will not be obvious by looking at them that they are farcical. Thankfully, examples of these artifacts are rare.

Second, it is technically acceptable to use lowercase letters in FASTA format: they are mapped to uppercase. Genomalysis, however, does not do this. It feeds sequences in "as is" and outputs the same format that is inputted. This is important because the user input fields in Genomalysis are case sensitive. If you are inputting a FASTA file that contains lowercase sequences and you enter uppercase algorithm parameters, then they will not match even if they technically are sequence matches. Make sure your filter inputs that are sequences or sequence elements match the case of the FASTA file you are inputting for filtration.

Obtaining FASTA Files for Analysis. FASTA formatted text files are a very common way of storing large amounts of sequence data. For example, NCBI genomic and proteomic data are extensively archived in FASTA format. Comprehensive archives of NCBI sequence and meta-data can be obtained from their FTP site: http://www.ncbi.nlm.nih.gov/Ftp/. Various projects on this FTP site contain FASTA formatted genomes and proteomes, for example, the Genome Assembly/Annotation Projects and RefSeq. In these projects, FASTA formatted sequence files are often designated with a ".faa" or ".fna" extension, so it is a good idea to look around this site and see what types of data are represented in various projects.

Genomalysis, does not care what the extension of a FASTA file is: it will input and output files of any extension. If you attempt to filter a file that is not FASTA formatted, then the filtration process will simply not start. Additionally, the command line shell that Genomalysis runs from will throw numerous errors.

If you would like to simply use some examples of FASTA formatted sequence files without having to mess around with exploring the extensive archives of NCBI (or some other database), then you can use some example proteomes that we have included with Genomalysis. They are located in the folder where you installed Genomalysis in a subfolder titled "Example Proteomes." Additionally, you can download the same examples from the download page of the Genomalysis web site: http://genomalysis.org/Download.html

Top of page Previous page Next page