Gene Finding
Outline
- Overview
- Gene Structure Prediction
- Gene Naming
- Gene Locus Numbers
- Structure Prediction Validation
- Minimum ORF Length
Overview
Overview
This document describes some of the details of the methodology used to produce the automated gene calls for the genome of Fusarium graminearum. Automated gene calls were produced in essentially a two step procedure:- Gene location and structures were predicted using a combination of FGENESH, FGENESH+, and GENEWISE. This process is described in section Gene Structure Prediction.
- Gene "names" were assigned to predicted gene structures based on homology to previously annotated genes. This process is described in section Gene Naming.
Gene Structure Prediction
Gene structures were predicted using a combination of FGENESH, FGENESH+, and GENEWISE. Both FGENESH and FGENESH+ are gene prediction programs acquired from Softberry.com and GENEWISE is part of the WISE2 package developed by Ewan Birney and is available from the Sanger Center.
Both FGENESH and FGENESH+ utilize a statistical model of gene structure that require training on each organism for accurate prediction. FGENESH+ additionally combines a protein sequence with the statistical model to improve accuracy. We acquired these programs already trained by Softberry on Fusarium graminearum sequences.
GENEWISE (as we ran it), splices and aligns a protein sequence with genomic sequence to predict a gene structure. Although GENEWISE does utilize some species-specific parameters, most notably for intron nucleotide statistics and splice site consensus sequences, these can be set to non-species specific defaults. In this case, GENEWISE essentially produces the best local alignment of a protein assuming that introns start at GT and end at AG most of the time and in some cases this results a full alignment of the protein to the genome. Since we are interested in predicting complete gene structures, we post-processed GENEWISE incomplete protein alignments by moving the first and last exon upstream or downstream to the nearest start and stop codons respectively. If a stop codon was encountered upstream of a gene before a start could be found, the gene call was not used.
An assessment of the accuracy of GENEWISE as well as FGENESH, and FGENESH+ is described below in section Structure Prediction Validation.
Briefly, these three gene callers were combined in the following manner:
- FGENESH was run on the entire genomic sequence to provide an initial set of predicted genes. Each FGENESH predicted was put into a set of EVIDENCE_GENES.
- The genome was also searched against the non-redundant protein database using BLASTX
- Regions of the genome with blastx homology spanning over 80% of a protein (when sub-alignments are stitched together in a consistent fashion) were considered "Homologous Gene Regions" (HGRs).
- HGRs were clustered into groups of HGRs that all implicated the same gene structure (most often representing groups of essentially orthologous proteins).
- For each cluster of HGRs, the protein showing the most sequence similarity to the genome was passed to both FGENESH and GENEWISE to produce 2 gene predictions, if the protein had >80% amino acid identity to the translated genome (cumulative across sub-alignments).
- If the protein used in the previous had >90% amino acid identity to the translated genome (cumulative across sub-alignments), then the GENEWISE call, if valid, was favored over the FGENESH+ call, and was used as the EVIDENCE_GENE for the HGR (see below for the reason why) and added to the set of EVIDENCE_GENES. If this protein had >80% but less than 90% amino acid identity to the translated genome (cumulative across sub-alignments), then the FGENESH+ call, if valid, was favored over the GENEWISE call, and was used as the EVIDENCE_GENE for the HGR (see Structure Prediction Validation for the reason why) and added to the set of EVIDENCE_GENES.
- When EVIDENCE_GENES overlapped in their exons, the EVIDENCE_GENE with the least amount of homology support (as measured by the sequence similarity of the protein used to make the call or zero for FGENESH calls) was removed from the set of EVIDENCE_GENES.
- All remaining EVIDENCE_GENES were then called as our official ANNOTATED_GENES and passed to the next step of gene calling for Gene Naming.
Gene Naming
Genes are assigned names VERY CONSERVATIVELY. Because this is a purely automated gene prediction process, we do not want to propogate mis-information by transfering unverified functional names for genes in one species to predicted genes in another species.
We hope to improve the gene naming process in the future based on Gene Ontology categories.
There are currently 5 types of gene names, that make up 3 categories:
- NAME, or
hypothetical protein similar to NAME, or
conserved hypothetical proteinAssigned to gene predictions where there is excellent homology to an known NR protein. The criteria for this category are:
- Top BlastP hit to a known NR protein (complexity filtering off -F F, expect <= 1e-5), with
- >=80% identity and >= 80% coverage of both the query and subject sequence.
The exact name is assigned:- NAME if the homologous protein is from the curated SwissProt gene set (IE we trust the gene name), otherwise:
- conserved hypothetical protein if the homologous protein NAME contains a word in the set {hypothetical, homolog, probable, putative, similar to, predicted, unnamed, unknown} (IE we do not want to transfer suspect names), otherwise
- hypothetical protein similar to NAME
- Hypothetical protein
Assigned to gene predictions that show significant BlastP homology to a protein in NCBI's protein set NR or an EST alignment. The criteria for this category are:- BlastP hit to NR (complexity filtering off -F F, expect <= 1e-5), or
- EST hit (>=300nt, >=98%identity, >95% coverage) which overlaps gene
- Predicted protein
Assigned to gene predictions that do not have an EST alignment or show significant BlastP homology to any proteins in NCBI's non-redundant set of proteins (NR) at the time that the complete BlastP analysis was performed on the gene set. The criteria for this category are:- No BlastP hit to NR (complexity filtering off -F F, expect <= 1e-5), and
- No EST hit (>=300nt, >=98%identity, >95% coverage) which overlaps gene
Gene Locus Numbers
Every annotated gene is given a Locus Number of the form FG##### that should be considered the only guaranteed way to identify a gene uniquely.. Each locus number is guaranteed to identify a unique gene even over different assemblies. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. We feel that it is a bad idea to encoding attributes of an object, such as position, in its identifier. Position is an attribute of a gene that can be retrieved by the locus.
With each new assembly, we do our best to map all genes from the previous assembly and thus preserve loci. Any loci that cannot be mapped will be retired. New genes will receive new loci. Each gene also has a version attribute (so loci are in fact displayed as FG#####.version). When genes are mapped from one assembly to another or if a gene call is altered, we will increment this version. All the loci in a particular release will have the same version number so that we can ensure consistency.
Structure Prediction Validation
For a description of the structure prediction validation, please refer to the Neurospora crassa Automated Gene Calling page.
Minimum ORF Length
The gene prediction programs produced a large number of very short open reading frames.
A final filter was applied to the gene set to discard open reading frames less then 100 amino acids, if there was no other evidence supporting the gene prediction. Any short genes that showed BLAST homology or BLAT alignment to ESTs were retained.
