Gene Finding Methods

Outline

Overview

This document describes some of the details of the method used to produce the automated gene calls for the Rhizopus oryzae genome. Automated gene calls were produced in a three-step procedure:

  • Gene location and structures were predicted using a combination of FGENESH and GENEID. This process is described in section Gene Structure Prediction.
  • Gene names were assigned to predicted gene structures based on homology to previously annotated genes. This process is described in section Gene Naming.
  • The newly created genes were compared against EST data to evaluate accuracy. This process is described in section Structure Prediction Validation.

back to top

Gene Structure Prediction

Gene structures were predicted using a combination of FGENESH and GENEID. FGENESH is a commercial gene prediction program sold by Softberry, while GENEID, by Enrique Blanco and Roderic Guigó, is available under the GPL. The version of GENEID used for these calls was 1.2a. FGENESH is unversioned.

FGENESH uses a statistical model of gene structure that requires training on each organism for accurate prediction. Softberry performed the training on Rhizopus oryzae sequences. GENEID is an ab initio gene caller and was run with the default parameters after being trained on a training set of 500 manually-curated genes.

Briefly, the results from these two gene callers were combined in the following manner:

  1. Both FGENESH and GENEID were run on the entire genomic sequence to provide an initial set of predicted genes. This resulted in a set containing 18,285 FGENESH predictions and 14,554 GENEID predictions.
  2. Next, short predictions were filtered out. Any gene prediction less than 30aa (90nt) long was dropped. In addition, any gene prediction less than 50aa (150nt) long was dropped, unless it was overlapped by another gene prediction, BLAST evidence, a hmmer hit, or EST evidence. Applying these criteria removed 1515 genes from consideration.
  3. 806 manually-annotated genes were transferred onto RO3 from the 919 manually annotated genes on RO1. Due to changes in the assembly from release 1 to 3, the remaining 103 manually-annotated genes could not be cleanly mapped to one region of the genome and were discarded.
  4. Wherever remaining FGENESH and GENEID predictions had overlapping exons, on identical or opposing strands, we clustered them into separate groups for further analysis. A set of non-overlapping genes was chosen from each cluster by ordering the genes from "best" to "worst", picking the "best" gene, then going down the list and adding any lower-ranked genes that did not overlap ones already selected. Genes were ranked according to the following criteria:

    • A gene was considered to have "good BLAST coverage" if it overlapped at least three BLAST hits from different taxa, each with ≥ 60% average identity and ≥ 80% query coverage.
    • Predictions with good BLAST coverage were chosen above predictions that did not have good BLAST coverage.
    • If two predictions both had good BLAST coverage, we computed the average overlapping BLAST length for each gene; namely, the average length of the three BLAST hits as defined above. (When there were more than three, the average was computed from the three with the highest scores.) When both genes were longer than their respective average overlapping BLAST lengths, the gene closer in length to the average overlapping BLAST length was chosen first.
    • Otherwise, GENEID was preferred to FGENESH.

  5. After sorting and selecting the highest-ranked predictions, the resultant gene set contained 17,467 genes.

back to top

Gene Naming

Genes are assigned names very conservatively. As this is a purely automated gene prediction process, we do not want to propagate misinformation by transferring unverified functional names for genes in one species to predicted genes in another species.

We hope to improve the gene naming process in the future based on Gene Ontology categories.

There are currently 5 types of gene name that fall into 3 categories:

  1. NAME, or
    hypothetical protein similar to NAME, or
    conserved hypothetical protein

    Assigned to gene predictions where there is excellent homology to an known NR protein. The criteria for this category are:

    • At least one BLASTP hit to a known NR protein (complexity filtering off, -F F, expect ≤ 1e-5), with
    • ≥ 80% identity and ≥ 80% coverage of both the query and subject sequence.

    The name will follow one of these three formats:

    • conserved hypothetical protein if the homologous protein NAME contains a word indicating the name has not been verified: {fragment, homolog, hypothetical, like, predicted, probable, putative, related, similar, synthetic, unknown, unnamed}, otherwise
    • NAME if the homologous protein is from the curated Swiss-Prot gene set, otherwise:
    • hypothetical protein similar to NAME

    Where there is more than one suitable name for a BLAST hit, we prefer Swiss-Prot names to non-Swiss-Prot names. If there are multiple distinct BLAST hits we choose the one with the highest average identity × the amount of overlap to the target gene.

    In all cases we take the NR protein name and filter out the species name, GIs, parenthetical comments, extra whitespace, etc.

  2. Hypothetical protein

    Assigned to gene predictions that show significant BLASTP homology to a protein in NCBI's protein set NR. The criteria for this category are:

    • BLASTP hit to NR (complexity filtering off, -F F, expect ≤ 1e-5)

  3. Predicted protein

    Assigned to gene predictions that do not show significant BLASTP homology to any proteins in NCBI's non-redundant set of proteins (NR) at the time that the complete BLASTP analysis was performed on the gene set.

back to top

Summary of automated gene naming

"conserved hypothetical protein" 381
"hypothetical protein" 9728
"predicted protein" 7279
 other non-empty name 79

back to top

Gene Locus Numbers

Every annotated gene is given a Locus Number of the form RO3_G##### that should be considered the only guaranteed way to identify a gene uniquely. Each locus number is guaranteed to identify a unique gene even over different assemblies. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. We feel that it is a bad idea to encoding attributes of an object, such as position, in its identifier. Position is an attribute of a gene that can be retrieved by the locus.

back to top

Structure Prediction Validation

To evaluate the accuracy of our gene predictions for Rhizopus oryzae assembly 3, we created a set of reference gene models exclusively from EST data. We then compared the two gene sets using a variety of metrics. In the tables below, we refer to the final, published gene set as the query and the EST-based gene set as the reference.

The Feature comparisons and Splice analysis sections only report on the subset of query genes that overlap reference genes. Although a substantial number of predicted genes overlap EST alignments, the majority do not. Because we use EST data to improve our gene calls, we expect lower accuracy in regions that lack supporting EST evidence, on the order of 5–10%. Therefore, while they are a useful measure of gene prediction accuracy, the numbers reported in these two sections do not apply evenly to all predicted genes.

Overview of query genes

17467 genes: 17467 transcripts, 13540 spliced, 3927 unspliced
57981 exons, 40514 introns

min median mean max
overall length (incl. UTR) 90 771 1028 17823
coding length 90 771 1028 17823
exons per transcript 1 3 3.32 31
exons per spliced transcript 2 3 3.99 31
bp per exon 1 166 309 9317
bp per intron 27 57 79 1451

len

%cov

%gc

%at
exonic 17963867 39.69 40.60 59.40
intronic 3205485 7.08 29.97 69.93
intergenic 24093125 53.23 32.62 67.39

Overview of reference genes

4027 genes: 5315 transcripts, 4291 spliced, 1024 unspliced
16872 exons, 11557 introns

min median mean max
overall length (incl. UTR) 80 717 763 3312
coding length 80 717 763 3312
exons per transcript 1 3 3.17 14
exons per spliced transcript 2 3 3.69 14
bp per exon 1 154 240 3274
bp per intron 4 54 62 1239

len

%cov

%gc

%at
exonic 3069565 6.78 39.79 60.21
intronic 500130 1.10 28.23 71.73
intergenic 41692782 92.11 35.38 64.62

Feature mapping

Searched for mappings from 17467 query features to 4027 reference features.

Reference genes farther than 200bp from any query gene: 84
Containing 88 transcripts: 58 unspliced, 10 with non-canonical splices, 20 with canonical splices.
These may indicate missed genes.

113 query genes (0.6%) hit a reference on the opposite strand.
50 reference genes (1.2%) hit a query on the opposite strand.

Gene hits on opposite strands may be orientation problems in the reference set and are considered misses in the following comparisions. The table below shows how query genes (rows, roman type) map to ref­er­ence genes (col­umns, ita­lic type). We clus­ter genes into over­lap­ping groups on the same strand and re­cord eight types of rela­tion­ship. To the left of each arrow is the num­ber of query genes in each type of clus­ter; the num­ber of ref­er­ence genes is to the right.

none reference
one
many
none - 0 180 0 0
query one 13991 0 3069 3069 367 752
many 0 0 30 15 10 11
counts per cluster type
 
none reference
one
many
none - 0 4.5 0 0
query one 80.1 0 17.6 76.2 2.1 18.7
many 0 0 0.2 0.4 0.1 0.3
percentages per cluster type

one↔one clear, unam­big­uous map­pings from one query gene to one ref­er­ence gene.
none↔one ref­er­ence genes that do not map to any query gene (misses).
none↔many ref­er­ence genes that over­lap other ref­er­ence genes but do not map to query genes. None↔X map­pings may rep­re­sent under­pre­dic­tions.
one↔none query genes that do not map to any ref­er­ence gene.
many↔none query genes that over­lap other query genes but do not map to ref­er­ence genes. X↔none map­pings may repre­sent over­pre­dic­tions or de­fi­cien­cies in the ref­er­ence gene set.
one↔many single query genes over­lap­ped by mul­ti­ple ref­er­ence genes (merges). These may oc­cur if the ref­er­ence set con­tains par­tial genes.
many↔one single ref­er­ence genes span­ning mul­ti­ple query genes (splits). These are often an­no­ta­tion errors.
many↔many com­plex clus­ters where multiple query genes map to multiple ref­er­ence genes.

Feature comparisons

Pre­dict­ed genes that missed a ref­er­ence gene, or hit one on the wrong strand, are ig­nored in the fol­low­ing cal­cula­tions. We se­lect the sin­gle best tran­script↔tran­script match when com­par­ing genes with mul­ti­ple tran­scripts. For clus­ters in the one↔many, many↔one, and many↔many cate­gor­ies above, one query↔ref­er­ence gene com­par­i­son is chosen per clus­ter.

#comp TP TN FP FN
nucleotides 3456 2195167 373626 9943 47513
splice sites 3240 11184 - 932 1015

sn

sp

smc

acp

ac

cc
nucleotides 0.9788 0.9955 0.9781 0.9589 0.9178 0.9171
splice sites 0.9168 0.9231 0.8517 - - -

#comp number of feature:feature com­par­isons. The num­ber of com­par­isons on the splice level may be lower be­cause our splice com­par­ison algo­rithm ig­nores ref­er­ence genes with non-canon­ical splices.
TP true pos­itives: nuc­leo­tides pre­dict­ed as exonic in both query and ref­er­ence; or, splice junc­tions with exact agree­ment, in posi­tion and type (donor:donor, accep­tor:accep­tor), be­tween query and ref­er­ence.
TN true nega­tives: nuc­leo­tides pre­dict­ed as intronic in both query and ref­er­ence; not de­fin­ed for splice sites.
FP false posi­tives (over­pre­dic­tions): nuc­leo­tides marked as exonic in the query but intronic in the refer­ence; or, splice junc­tions ident­ified only in the query.*
FN false nega­tives (under­pre­dic­tions): nuc­leo­tides marked as intronic in the query but exonic in the ref­er­ence; or, splice junc­tions iden­ti­fied only in the ref­er­ence.*
sn sen­si­tiv­ity, TP ÷ (TP+FN).
sp spe­ci­fic­ity, TP ÷ (TP+FP).
smc sim­ple match­ing co­effi­cient, (TP+TN) ÷ (TP+FN+FP+TN).
acp av­er­age con­di­tion­al prob­ability.
ac ap­prox­imate cor­re­la­tion.
cc cor­re­la­tion co­ef­fi­cient. These last three terms are des­cribed in detail in the cit­a­tion below, and are not well de­fined for splices.
*In the very rare cases where the query pre­dicts a donor site exactly where the ref­er­ence has an ac­cep­tor, or vice versa, we count it as ½ FP and ½ FN. Ter­min­ology is from Bur­set M and Guigo R, "Eval­ua­tion of gene struc­ture pre­dic­tion pro­grams." Genomics, 1996 Jun 15;34(3):353-67.

The table below briefly summarizes the relative coverage of overlapping query and reference genes. The first column shows the amount of sequence (in bp and percent) in each gene set that is overlapped by sequence in the other gene set. The other columns show the overhanging sequence on both the 5′ and 3′ sides. As above, query genes that miss the reference, or reference genes missed by the query, are not considered in this table.

overlap overhang, 5′ overhang, 3′
query 2626249 bp / 50.2% 1410859 bp / 27.0% 1191513 bp / 22.8%
reference 2626249 bp / 86.8% 179120 bp / 5.9% 221753 bp / 7.3%

Splice analysis

11184 splice agreements, 1947 disagreements, 6067 ignored at ends.
perfect exon:exon/intron:intron matches: 3194/5309
1 query transcripts contained noncanonical splices.
216 reference transcripts contained noncanonical splices and were ignored in the following comparisions.

transcripts with no splice problems 2537 78.3%
  explainable by alternate splicing 370 11.4%
  clashes 333 10.3%

Transcripts that have a splice site dis­agree­ment with an over­lap­ping ref­er­ence gene are placed into two cate­gor­ies, de­pend­ing on the se­ver­ity of the clash. If all splice dis­agree­ments could be ex­plain­ed by well-known types of al­ter­nate splic­ing, we call the trans­cript a "pos­sible al­ter­nate splice." If the two trans­cripts cannot be rec­on­ciled in this way, we label the query a "clash." In par­ti­tio­ning splice dis­agree­ments into two cat­e­gor­ies, we are not as­sert­ing that 11.4% of this ge­nome shows al­ter­nate splic­ing. We do this as a form of triage: genes in the "clash" cat­e­go­ry are man­ually in­spect­ed before release.

cassette exons (cryptic/skipped): 20/139
retained introns (cryptic/skipped): 8/13
modified 5′ splice (query exon longer/shorter): 40/169
modified 3′ splice (query exon longer/shorter): 69/115

cassette exon an exon in one tran­script found within an al­ter­nate tran­script's intron.
retained intron an intron in one tran­script found within an al­ter­nate tran­script's exon.
cryptic pre­sent in the query but not in the ref­er­ence.
skipped pre­sent in the ref­er­ence but not in the query.
5′, 3′ splice sites are labeled by their po­si­tion rela­tive to in­trons: e.g., a mod­ified 3′ splice site is on the lead­ing edge of an exon.
Ter­mi­nol­ogy is from Mat­lin AJ, et. al. Under­stand­ing al­ter­na­tive splic­ing: to­wards a cel­lular code. Nat Rev Mol Cell Biol. 2005 May;6(5):386-98.

Possible problems

RO3_GENETRANSFER_19 781
RO3_FGENESH_2 4166
RO3_MANUAL_1 38
RO3_GENEID_1 12482
short proteins < 50aa 345 6 62 2 275 ← not tallied in problems
shorter proteins < 30aa 0 - - - -
very short proteins < 10aa 0 - - - -
exon-less transcripts 0 - - - -
initial exon ≤ 6bp 462 41 127 2 292
internal exon ≤ 6bp 53 1 52 0 0
terminal exon ≤ 6bp 254 6 69 0 179
intron ≥ 1000bp 1 0 1 0 0
coding length not mod 3 15 11 4 0 0
first codon not START 17 2 1 0 14
last codon not STOP 30 4 5 0 21
contains 1+ N in exon 0 - - - -
non-canonical splicing 1 0 0 1 0
overlapping 0 - - - -
spanning contigs 20 0 11 0 9
one or more problems 834 60 261 3 510

back to top