Introduction
Genomalysis is a Java application currently implemented for Windows that allows users to perform data mining and viewing operations on the proteomes and Genomes of various species. The project aims to provide a rich graphical user interface with which end users can mine for and analyze sequences of interest. Currently, Genomalysis can open, parse, and perform mining functions on files containing genomic and proteomic data in FASTA format. The project aims to provide an extensible mechanism to do bulk processing on sequence data in order to facilitate gene discovery and characterization efforts. The idea for Genomalysis was thought up by Benjamin Patterson, a master's level biologist who studied at Humboldt State University. Some of the research that was facilitated by Genomalysis is mentioned in his master's thesis (PDF 1MB). The original implementation of code in Java/Swing was done by Wolfgang Meyers.
Data mining
Currently in Genomalysis, data mining consists of selecting and configuring a set of sequence filters that will be applied to sequences in an input file. When the filters are executed, sequences are tested against each filter in turn, and receive a pass/fail response at each step. Sequences that receive a pass response from all filters are written to an output file. Here is a list of sequence filters that are currently implemented in Genomalysis:
- Secretion Signal Filter: This filter tests for predicted secretion signals and their associated cleavage sites in protein sequences using the PrediSi algorithm. There are three implementations of this filter in Genomalysis: one for sequences from Gram negative bacteria, one for sequences from Gram positive bacteria, and one for sequences from eukaryotic cells. The PrediSi algorithm was developed by Hiller et al.
- Clustal Omega Filter: This allows the user to filter protein or DNA sequences based on various
parameters of alignment to a known sequence, parameters such as total number of identities,
strong groups and weak groups. Additional information about Clustal can be found at their web
page:
http://www.clustal.org/ - Regex Filter: This filter allows the user to apply regular expressions to test protein or DNA sequences. If you are unfamiliar with regular expressions, then read "A Primer on Regex" at the end of the filters section of the Genomalysis user guide.
- Transmembrane Prediction Filter: This filter tests protein sequences for predicted transmembrane segments using the single sequence version of the TMAP algorithm. The filter can be configured to find sequences that have a minimum and maximum number of transmembrane segments. The TMAP algorithm was developed by Persson and Argos.
- Sequence Length Filter: This filter tests protein or DNA sequences based on the number of monomers they contain. Sequences pass the filter if they are between a user designated minimum and maximum length.
Sequence Viewing
Currently, Genomalysis can be used to view sequences that are contained in a FASTA formatted file. When such a file is opened, Genomalysis will display the number of sequences that the file contains and give you a list of the contained sequences with with you can select individual sequences for viewing.