Gene Finding Methods
Outline
- Overview
- Gene Structure Prediction
- Gene Naming
- Summary of automated gene naming
- Gene Locus Numbers
- Structure Prediction Validation
Overview
This document describes some of the details of the method used to produce the automated gene calls for the Rhizopus oryzae genome. Automated gene calls were produced in a three-step procedure:
- Gene location and structures were predicted using a combination of FGENESH and GENEID. This process is described in section Gene Structure Prediction.
- Gene names were assigned to predicted gene structures based on homology to previously annotated genes. This process is described in section Gene Naming.
- The newly created genes were compared against EST data to evaluate accuracy. This process is described in section Structure Prediction Validation.
Gene Structure Prediction
Gene structures were predicted using a combination of FGENESH and GENEID. FGENESH is a commercial gene prediction program sold by Softberry, while GENEID, by Enrique Blanco and Roderic Guigó, is available under the GPL. The version of GENEID used for these calls was 1.2a. FGENESH is unversioned.
FGENESH uses a statistical model of gene structure that requires training on each organism for accurate prediction. Softberry performed the training on Rhizopus oryzae sequences. GENEID is an ab initio gene caller and was run with the default parameters after being trained on a training set of 500 manually-curated genes.
Briefly, the results from these two gene callers were combined in the following manner:
- Both FGENESH and GENEID were run on the entire genomic sequence to provide an initial set of predicted genes. This resulted in a set containing 18,285 FGENESH predictions and 14,554 GENEID predictions.
- Next, short predictions were filtered out. Any gene prediction less than 30aa (90nt) long was dropped. In addition, any gene prediction less than 50aa (150nt) long was dropped, unless it was overlapped by another gene prediction, BLAST evidence, a hmmer hit, or EST evidence. Applying these criteria removed 1515 genes from consideration.
- 806 manually-annotated genes were transferred onto RO3 from the 919 manually annotated genes on RO1. Due to changes in the assembly from release 1 to 3, the remaining 103 manually-annotated genes could not be cleanly mapped to one region of the genome and were discarded.
-
Wherever remaining FGENESH and GENEID predictions had overlapping exons, on identical or opposing strands, we clustered them into separate groups for further analysis. A set of non-overlapping genes was chosen from each cluster by ordering the genes from "best" to "worst", picking the "best" gene, then going down the list and adding any lower-ranked genes that did not overlap ones already selected. Genes were ranked according to the following criteria:
- A gene was considered to have "good BLAST coverage" if it overlapped at least three BLAST hits from different taxa, each with ≥ 60% average identity and ≥ 80% query coverage.
- Predictions with good BLAST coverage were chosen above predictions that did not have good BLAST coverage.
- If two predictions both had good BLAST coverage, we computed the average overlapping BLAST length for each gene; namely, the average length of the three BLAST hits as defined above. (When there were more than three, the average was computed from the three with the highest scores.) When both genes were longer than their respective average overlapping BLAST lengths, the gene closer in length to the average overlapping BLAST length was chosen first.
- Otherwise, GENEID was preferred to FGENESH.
- After sorting and selecting the highest-ranked predictions, the resultant gene set contained 17,467 genes.
Gene Naming
Genes are assigned names very conservatively. As this is a purely automated gene prediction process, we do not want to propagate misinformation by transferring unverified functional names for genes in one species to predicted genes in another species.
We hope to improve the gene naming process in the future based on Gene Ontology categories.
There are currently 5 types of gene name that fall into 3 categories:
- NAME, or
hypothetical protein similar to NAME, or
conserved hypothetical proteinAssigned to gene predictions where there is excellent homology to an known NR protein. The criteria for this category are:
- At least one BLASTP hit to a known NR protein (complexity filtering off, -F F, expect ≤ 1e-5), with
- ≥ 80% identity and ≥ 80% coverage of both the query and subject sequence.
The name will follow one of these three formats:
- conserved hypothetical protein if the homologous protein NAME contains a word indicating the name has not been verified: {fragment, homolog, hypothetical, like, predicted, probable, putative, related, similar, synthetic, unknown, unnamed}, otherwise
- NAME if the homologous protein is from the curated Swiss-Prot gene set, otherwise:
- hypothetical protein similar to NAME
Where there is more than one suitable name for a BLAST hit, we prefer Swiss-Prot names to non-Swiss-Prot names. If there are multiple distinct BLAST hits we choose the one with the highest average identity × the amount of overlap to the target gene.
In all cases we take the NR protein name and filter out the species name, GIs, parenthetical comments, extra whitespace, etc.
- Hypothetical protein
Assigned to gene predictions that show significant BLASTP homology to a protein in NCBI's protein set NR. The criteria for this category are:
- BLASTP hit to NR (complexity filtering off, -F F, expect ≤ 1e-5)
- Predicted protein
Assigned to gene predictions that do not show significant BLASTP homology to any proteins in NCBI's non-redundant set of proteins (NR) at the time that the complete BLASTP analysis was performed on the gene set.
Summary of automated gene naming
| "conserved hypothetical protein" | 381 |
| "hypothetical protein" | 9728 |
| "predicted protein" | 7279 |
| other non-empty name | 79 |
Gene Locus Numbers
Every annotated gene is given a Locus Number of the form RO3_G##### that should be considered the only guaranteed way to identify a gene uniquely. Each locus number is guaranteed to identify a unique gene even over different assemblies. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. We feel that it is a bad idea to encoding attributes of an object, such as position, in its identifier. Position is an attribute of a gene that can be retrieved by the locus.
Structure Prediction Validation
To evaluate the accuracy of our gene predictions for Rhizopus oryzae assembly 3, we created a set of reference gene models exclusively from EST data. We then compared the two gene sets using a variety of metrics. In the tables below, we refer to the final, published gene set as the query and the EST-based gene set as the reference.
The Feature comparisons and Splice analysis sections only report on the subset of query genes that overlap reference genes. Although a substantial number of predicted genes overlap EST alignments, the majority do not. Because we use EST data to improve our gene calls, we expect lower accuracy in regions that lack supporting EST evidence, on the order of 5–10%. Therefore, while they are a useful measure of gene prediction accuracy, the numbers reported in these two sections do not apply evenly to all predicted genes.
Overview of query genes
17467 genes: 17467 transcripts,
13540 spliced, 3927 unspliced
57981 exons, 40514 introns
| min | median | mean | max | |
|---|---|---|---|---|
| overall length (incl. UTR) | 90 | 771 | 1028 | 17823 |
| coding length | 90 | 771 | 1028 | 17823 |
| exons per transcript | 1 | 3 | 3.32 | 31 |
| exons per spliced transcript | 2 | 3 | 3.99 | 31 |
| bp per exon | 1 | 166 | 309 | 9317 |
| bp per intron | 27 | 57 | 79 | 1451 |
len |
%cov |
%gc |
%at |
|
| exonic | 17963867 | 39.69 | 40.60 | 59.40 |
| intronic | 3205485 | 7.08 | 29.97 | 69.93 |
| intergenic | 24093125 | 53.23 | 32.62 | 67.39 |
Overview of reference genes
4027 genes: 5315 transcripts,
4291 spliced, 1024 unspliced
16872 exons, 11557 introns
| min | median | mean | max | |
|---|---|---|---|---|
| overall length (incl. UTR) | 80 | 717 | 763 | 3312 |
| coding length | 80 | 717 | 763 | 3312 |
| exons per transcript | 1 | 3 | 3.17 | 14 |
| exons per spliced transcript | 2 | 3 | 3.69 | 14 |
| bp per exon | 1 | 154 | 240 | 3274 |
| bp per intron | 4 | 54 | 62 | 1239 |
len |
%cov |
%gc |
%at |
|
| exonic | 3069565 | 6.78 | 39.79 | 60.21 |
| intronic | 500130 | 1.10 | 28.23 | 71.73 |
| intergenic | 41692782 | 92.11 | 35.38 | 64.62 |
Feature mapping
Searched for mappings from 17467 query features to 4027 reference features.
Reference genes farther than 200bp from any query gene: 84
Containing 88 transcripts:
58 unspliced,
10 with non-canonical splices,
20 with canonical splices.
These may indicate missed genes.
113 query genes (0.6%) hit a reference
on the opposite strand.
50 reference genes (1.2%) hit a query on
the opposite strand.
Gene hits on opposite strands may be orientation problems in the reference set and are considered misses in the following comparisions. The table below shows how query genes (rows, roman type) map to reference genes (columns, italic type). We cluster genes into overlapping groups on the same strand and record eight types of relationship. To the left of each arrow is the number of query genes in each type of cluster; the number of reference genes is to the right.
| none | reference one |
many | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| none | - | 0 | ↔ | 180 | 0 | ↔ | 0 | |||
| query | one | 13991 | ↔ | 0 | 3069 | ↔ | 3069 | 367 | ↔ | 752 |
| many | 0 | ↔ | 0 | 30 | ↔ | 15 | 10 | ↔ | 11 | |
| counts per cluster type | ||||||||||
| none | reference one |
many | ||||||||
| none | - | 0 | ↔ | 4.5 | 0 | ↔ | 0 | |||
| query | one | 80.1 | ↔ | 0 | 17.6 | ↔ | 76.2 | 2.1 | ↔ | 18.7 |
| many | 0 | ↔ | 0 | 0.2 | ↔ | 0.4 | 0.1 | ↔ | 0.3 | |
| percentages per cluster type | ||||||||||
one↔one clear, unambiguous
mappings from one query gene to one reference gene.
none↔one reference genes that do not map to
any query gene (misses).
none↔many reference genes that
overlap other reference genes but do not map to query
genes. None↔X mappings may represent
underpredictions.
one↔none query
genes that do not map to any reference gene.
many↔none query genes that overlap other
query genes but do not map to reference genes.
X↔none mappings may represent
overpredictions or deficiencies in the
reference gene set.
one↔many single
query genes overlapped by multiple
reference genes (merges). These may occur if the
reference set contains partial genes.
many↔one single reference genes
spanning multiple query genes (splits). These are often
annotation errors.
many↔many complex clusters where
multiple query genes map to multiple reference genes.
Feature comparisons
Predicted genes that missed a reference gene, or hit one on the wrong strand, are ignored in the following calculations. We select the single best transcript↔transcript match when comparing genes with multiple transcripts. For clusters in the one↔many, many↔one, and many↔many categories above, one query↔reference gene comparison is chosen per cluster.
| #comp | TP | TN | FP | FN | ||
|---|---|---|---|---|---|---|
| nucleotides | 3456 | 2195167 | 373626 | 9943 | 47513 | |
| splice sites | 3240 | 11184 | - | 932 | 1015 | |
sn |
sp |
smc |
acp |
ac |
cc |
|
| nucleotides | 0.9788 | 0.9955 | 0.9781 | 0.9589 | 0.9178 | 0.9171 |
| splice sites | 0.9168 | 0.9231 | 0.8517 | - | - | - |
#comp number of feature:feature comparisons. The
number of comparisons on the splice level may be lower
because our splice comparison algorithm ignores
reference genes with non-canonical splices.
TP true positives: nucleotides
predicted as exonic in both query and reference; or,
splice junctions with exact agreement, in position and
type (donor:donor, acceptor:acceptor), between query and
reference.
TN true negatives:
nucleotides predicted as intronic in both query and
reference; not defined for splice sites.
FP false positives (overpredictions):
nucleotides marked as exonic in the query but intronic in the
reference; or, splice junctions identified only in the
query.*
FN false negatives
(underpredictions): nucleotides marked as
intronic in the query but exonic in the reference; or, splice
junctions identified only in the reference.*
sn sensitivity,
sp specificity,
smc simple matching
coefficient,
acp average
conditional probability.
ac approximate correlation.
cc correlation coefficient.
These last three terms are described in detail in the
citation below, and are not well defined for splices.
*In the very rare cases where the query predicts a donor site exactly
where the reference has an acceptor, or vice versa,
we count it as ½ FP and ½ FN. Terminology is
from Burset M and Guigo R, "Evaluation of gene
structure prediction programs." Genomics, 1996 Jun
15;34(3):353-67.
The table below briefly summarizes the relative coverage of overlapping query and reference genes. The first column shows the amount of sequence (in bp and percent) in each gene set that is overlapped by sequence in the other gene set. The other columns show the overhanging sequence on both the 5′ and 3′ sides. As above, query genes that miss the reference, or reference genes missed by the query, are not considered in this table.
| overlap | overhang, 5′ | overhang, 3′ | |||||||
|---|---|---|---|---|---|---|---|---|---|
| query | 2626249 bp | / | 50.2% | 1410859 bp | / | 27.0% | 1191513 bp | / | 22.8% | reference | 2626249 bp | / | 86.8% | 179120 bp | / | 5.9% | 221753 bp | / | 7.3% |
Splice analysis
11184 splice agreements, 1947 disagreements,
6067 ignored at ends.
perfect exon:exon/intron:intron matches: 3194/5309
1 query transcripts contained noncanonical splices.
216 reference transcripts contained noncanonical splices
and were ignored in the following comparisions.
| transcripts with no splice problems | 2537 | 78.3% |
| explainable by alternate splicing | 370 | 11.4% |
| clashes | 333 | 10.3% |
Transcripts that have a splice site disagreement with an overlapping reference gene are placed into two categories, depending on the severity of the clash. If all splice disagreements could be explained by well-known types of alternate splicing, we call the transcript a "possible alternate splice." If the two transcripts cannot be reconciled in this way, we label the query a "clash." In partitioning splice disagreements into two categories, we are not asserting that 11.4% of this genome shows alternate splicing. We do this as a form of triage: genes in the "clash" category are manually inspected before release.
| cassette exons (cryptic/skipped): | 20/139 |
| retained introns (cryptic/skipped): | 8/13 |
| modified 5′ splice (query exon longer/shorter): | 40/169 |
| modified 3′ splice (query exon longer/shorter): | 69/115 |
cassette exon an exon in one transcript found within an
alternate transcript's intron.
retained
intron an intron in one transcript found within an
alternate transcript's exon.
cryptic present in the query but not in the
reference.
skipped present in the
reference but not in the query.
5′, 3′ splice sites are labeled by their
position relative to introns: e.g., a
modified 3′ splice site is on the leading edge of an
exon.
Terminology is from Matlin AJ, et. al.
Understanding alternative splicing:
towards a cellular code. Nat Rev Mol Cell
Biol. 2005 May;6(5):386-98.
Possible problems
| RO3_GENETRANSFER_19 | 781 | ||||||
|---|---|---|---|---|---|---|---|
| RO3_FGENESH_2 | 4166 | ||||||
| RO3_MANUAL_1 | 38 | ||||||
| RO3_GENEID_1 | 12482 | ||||||
| short proteins < 50aa | 345 | 6 | 62 | 2 | 275 | ← not tallied in problems | |
| shorter proteins < 30aa | 0 | - | - | - | - | ||
| very short proteins < 10aa | 0 | - | - | - | - | ||
| exon-less transcripts | 0 | - | - | - | - | ||
| initial exon ≤ 6bp | 462 | 41 | 127 | 2 | 292 | ||
| internal exon ≤ 6bp | 53 | 1 | 52 | 0 | 0 | ||
| terminal exon ≤ 6bp | 254 | 6 | 69 | 0 | 179 | ||
| intron ≥ 1000bp | 1 | 0 | 1 | 0 | 0 | ||
| coding length not mod 3 | 15 | 11 | 4 | 0 | 0 | ||
| first codon not START | 17 | 2 | 1 | 0 | 14 | ||
| last codon not STOP | 30 | 4 | 5 | 0 | 21 | ||
| contains 1+ N in exon | 0 | - | - | - | - | ||
| non-canonical splicing | 1 | 0 | 0 | 1 | 0 | ||
| overlapping | 0 | - | - | - | - | ||
| spanning contigs | 20 | 0 | 11 | 0 | 9 | ||
| one or more problems | 834 | 60 | 261 | 3 | 510 | ||
