Gene Finding

Outline

Overview

This document explains how 8794 automated gene models were produced for the Batrachochytrium dendrobatidis genome.

The annotation was created in two steps:

  1. Final gene structures were predicted by combining predictions from GENEID, FGENESH, EST_GENES and manual annotation. This process is described in Gene Structure Prediction.
  2. Gene names were assigned to predicted gene structures based on homology to previously annotated genes. This process is described in Gene Naming.

Gene Structure Prediction

Gene structures were predicted using a combination of manual annotation, FGENESH, GENEID and EST-based genes called FindORFs. FGENESH is a commercial gene prediction program sold by Softberry, while GENEID, by Enrique Blanco and Roderic Guigo, is available under the GPL. We used two versions of GENEID predictions. Both GENEID and FGENESH were trained using 600 EST-based gene models. GENEID_1 predictions were obtained using the default parameter for the trained GENEID while GENEID_2 predictions were produced by using a modified parameter file with adjusted intron-lengths to reduce concatenation of adjacent gene models. Where multiple predictions overlap each other and EST evidence, we choose the one most in accord with the EST splice sites. If we had sufficient EST coverage to build complete ORFs, we replaced any overlapping predicted gene model with the ORF predicted purely from ESTs.

FindORFs gene models are those clustered EST gene models with a valid start and stop codon. They were built as follows. First, ESTs are aligned to the genome and grouped into loci consisting of overlapping ESTs. Then, each locus is examined for compatible splicing. If two ESTs in the same locus have identical splice sites where they overlap, they are considered fragments of a larger transcript. Putative transcripts are incrementally built out by adding additional ESTs to either end. Each putative transcript is built from one or more ESTs, but may not represent the full biological transcript if the EST coverage is incomplete. We search each putative transcript for ORFs beginning with ATG and ending with a stop codon, with no frame shifts. If a putative transcript contains an ORF longer than 180 bases that covers 1/3 or more of its spliced length, we considered it a valid gene prediction. Further, we select only a subset of putative full-length gene models from these EST-based findORF transcripts that fully overlap the best ab inito gene prediction. An independent blast-based analysis was also carried out to validate both the length and the correctness of the reading frame by comparing them to the best hits known proteins in the NR database.

In the final step, we ran our automated gene caller to select the best gene model for each locus. Targeted manual annotation was carried out in loci where gene predictions clashed with EST evidence or blast evidence. Automated gene models overlapping tRNA/RNA features were removed from the final gene set. The resulting final gene set contains 8794 genes and 8819 transcripts.

Gene Naming

Genes are assigned names very conservatively. As this is a purely automated gene prediction process, we do not want to propagate misinformation by transferring unverified functional names for genes in one species to predicted genes in another species.

We hope to improve the gene naming process in the future based on Gene Ontology categories.

There are currently 5 types of gene name that fall into 3 categories:

NAME, or hypothetical protein similar to NAME, or conserved hypothetical protein

NAME is assigned to gene predictions where there is excellent homology to a known NR protein. The criteria for this category are:

  • At least one BLASTP hit to a known NR protein (complexity filtering off, -F F, expect = 1e-10),
  • A minimum of 50% identity and 70% coverage of both the query and subject sequence.

The name will follow one of these three formats:

  • conserved hypothetical protein if the homologous protein NAME contains a word indicating the name has not been verified: {fragment, homolog, hypothetical, like, predicted, probable, putative, related, similar, synthetic, unknown, unnamed}, otherwise
  • NAME if the homologous protein is from the curated Swiss-Prot gene set, otherwise:

hypothetical protein similar to NAME

  • Where there is more than one suitable name for a BLAST hit, we prefer Swiss-Prot names to non-Swiss-Prot names. If there are multiple distinct BLAST hits we choose the one with the highest average identity to the amount of overlap to the target gene.
  • In all cases we take the NR protein name and filter out the species name, GIs, parenthetical comments, extra white space, etc.

Hypothetical protein. Assigned to gene predictions that show significant BLASTP homology to a protein in NCBI's protein set NR. The criteria for this category are:

  • BLASTP hit to NR (complexity filtering off, -F F, expect = 1e-10)

Predicted protein. Assigned to gene predictions that do not show significant BLASTP homology to any proteins in NCBI's non-redundant set of proteins (NR) at the time that the complete BLASTP analysis was performed on the gene set.

 

 

Gene Numbering

Every annotated gene is given a Locus Number of the form BDEG_##### that should be considered the only guaranteed way to identify a gene uniquely. Each locus number is guaranteed to identify a unique gene even over different assemblies. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. We feel that it is a bad idea to encode attributes of an object, such as position, in its identifier. Position is an attribute of a gene that can be retrieved by the locus.

Structure Prediction Validation

To evaluate the accuracy of our gene predictions, we created a set of reference gene models exclusively from EST data. We then compared the two gene sets using a variety of metrics. In the tables below, we refer to the final, published gene set as the query and the EST-based gene set as the reference.

The Feature comparisons and Splice analysis sections only report on the subset of query genes that overlap reference genes. Although a substantial number of predicted genes overlap EST alignments, the majority do not. Because we use EST data to improve our gene calls, we expect lower accuracy in regions that lack supporting EST evidence, on the order of 5?10%. Therefore, while they are a useful measure of gene prediction accuracy, the numbers reported in those two sections do not apply evenly to all predicted genes.

Overview of Query Genes

 

8794 genes: 8819 transcripts (7801 spliced, 1018 unspliced; 4438/4381 +/-)
38552 exons, 29733 introns

len%cov%gc%at
genic1605392568.6040.6159.35
intergenic734969331.4036.4163.68
exonic1270441054.2842.4357.57
intronic335426814.3333.7066.10
coding1244126153.1642.5757.43
5′ UTR791250.3440.7659.24
3′ UTR1976600.8434.5165.49
alt. spliced47530.0236.0863.92
genomic23403618100.0039.2960.71

min

median

mean

max
overall length (incl. UTR)901113144414343
coding length901077141214265
exons per transcript134.3744
exons per spliced transcript244.8144
bp per exon116933011725
bp per intron2790113997
5′ UTR bp14694801
3′ UTR bp1831142005

Overview of Reference genes

3678 genes: 4026 transcripts (2748 spliced, 1278 unspliced; 2017/2009 +/-)
11893 exons, 7867 introns

len%cov%gc%at
genic368678915.7540.4859.52
intergenic1971682984.2539.0760.93
exonic286956712.2642.3857.62
intronic8378383.5833.8266.18
coding00.00--
5′ UTR00.00--
3′ UTR00.00--
alt. spliced206160.0935.3964.61
genomic23403618100.0039.2960.71

min

median

mean

max
overall length (incl. UTR)997267722908
coding length----
exons per transcript122.9515
exons per spliced transcript233.8615
bp per exon41672612908
bp per intron2788112993
5′ UTR bp----
3′ UTR bp----

Feature mapping

Searched for mappings from 8794 query features to 3678 reference features. Reference genes farther than 200bp from any query gene: 49
Containing 53 transcripts: 43 unspliced, 10 with canonical splices.
These may indicate missed genes. 72 query genes (0.8%) hit 66 reference genes (1.8%) on the opposing strand. Hits to opposing strands are considered misses in the following comparisions. Strand mismatches such as these may be orientation problems in the reference set, or incorrect predictions in the query set. Unspliced reference transcripts are difficult to orient correctly. 46 reference gene(s) did not contain a single canonical splice site. These may easily have been placed on the wrong strand; queries hitting these genes are likely to be oriented correctly. 20 reference gene(s) had at least one canonical splice site; everything else being equal, their orientation is more likely to be correct than the queries hitting them.

The table below shows how query genes (rows, roman type) map to reference genes (columns, italic type). We cluster genes into overlapping groups on the same strand and record eight types of relationship. To the left of each arrow is the number of query genes in each type of cluster; the number of reference genes is to the right.

nonereference
one
many
none-014000
queryone5646025802580421869
many00117583031
counts per cluster type
 
nonereference
one
many
none-03.800
queryone64.2029.370.14.823.6
many001.31.60.30.8
percentages per cluster type

one↔one clear, unambiguous mappings from one query gene to one reference gene. none↔one reference genes that do not map to any query gene (misses). none↔many reference genes that overlap other reference genes but do not map to query genes. None↔X mappings may represent underpredictions. one↔none query genes that do not map to any reference gene. many↔none query genes that overlap other query genes but do not map to reference genes. X↔none mappings may represent overpredictions or deficiencies in the reference gene set. one↔many single query genes overlapped by multiple reference genes (merges). These may occur if the reference set contains partial genes. many↔one single reference genes spanning multiple query genes (splits). These are often annotation errors. many↔many complex clusters where multiple query genes map to multiple reference genes.

Feature Comparisons

Predicted genes that hit reference genes on the wrong strand, or hit reference genes containing only non?canonically-spliced transcripts, are ignored in the following calculations. When scoring genes with multiple transcripts, we select the single best transcript↔transcript comparison. For clusters in the one↔many and many↔many categories above, each query is compared to as many references as possible. The best non-overlapping comparisons are used to score these loci.

#compTPTNFPFN
nucleotides34486602615258778535840946458
splice sites1789716195-1065637

sn

sp

smc

acp

ac

cc
nucleotides0.98250.99680.98410.97810.95620.9560
splice sites0.96220.93830.9049---

#compTPTNFPFN
introns73466621-23515
exons103069254-4753

sn

sp

sn_sp

wrong

miss
introns0.93120.90130.91630.00210.0320
exons0.90420.89790.90110.00520.0046

#comp the number of subfeatures compared. TP true positives: nucleotides predicted as exonic in both query and reference; or, splice junctions with exact agreement, in position and type (donor:donor, acceptor:acceptor), between query and reference; or, introns or exons with both splice sites in exact agreement with the reference. TN true negatives: nucleotides predicted as intronic in both query and reference; not defined for splice sites, introns or exons. FP false positives (overpredictions): nucleotides marked as exonic in the query but intronic in the reference; or, splice junctions identified only in the query*; or query introns/exons that do not overlap a reference intron/exon. FN false negatives (underpredictions): nucleotides marked as intronic in the query but exonic in the reference; or, splice junctions identified only in the reference*; or reference introns/exons that do not overlap a query intron/exon. sn sensitivity, TP �� (TP+FN). sp specificity, TP �� (TP+FP). smc simple matching coefficient, (TP+TN) �� (TP+FN+FP+TN). acp average conditional probability. ac approximate correlation. cc correlation coefficient. These last three terms are described in detail in the citation below, and are not well defined for splices, introns or exons. *In the very rare cases where the query predicts a donor site exactly where the reference has an acceptor, or vice versa, we count it as both FP and FN. Terminology is from Burset M and Guigo R, "Evaluation of gene structure prediction programs." Genomics, 1996 Jun 15;34(3):353-67.

The table below briefly summarizes the relative coverage of overlapping query and reference genes. The first column shows the amount of sequence (in bp and percent) in each gene set that is overlapped by sequence in the other gene set. The next three columns show the overhanging sequence from transcripts that overlap but are offset. Overhangs are categorized as 5′ (one transcript begins upstream of the other), 3′ (one transcript extends downstream past the other), and mixed (indeterminate). Finally, the last column, "complete miss", counts the number of bp in queries that did not hit a reference at all, and vice versa.

A high degree of overlap can indicate a good match between the query and reference sets. A high "missed" score may simply indicate a small reference set.

overlapoverhang, 5′overhang, 3′overhang, mixedcomplete miss
query3448660 bp/21.5%2210073 bp/13.8%682301 bp/4.2%768965 bp/4.8%8958001 bp/55.8%
reference3448660 bp/91.9%63305 bp/1.7%108584 bp/2.9%21037 bp/0.6%111292 bp/3.0%

Splice Analysis

16195 splice agreements, 1702 disagreements.
4477 ignored: 170 due to EST misalignment, 4307 due to partial initial/terminal exon coverage.
perfect exon:exon/intron:intron matches: 6376/6621
5 query transcripts contained noncanonical splices.

transcripts with no splice problems247278.5%
   ... with complete reference coverage60919.3%
explainable by alternate splicing39412.5%
   ... with spliced reference3129.9%
clashes2829.0%
   ... with spliced reference2277.2%

Transcripts that have a splice site dis��agree��ment with an over��lap��ping ref��er��ence gene are placed into two cate��gor��ies, de��pend��ing on the se��ver��ity of the clash. If all splice dis��agree��ments could be ex��plain��ed by well-known types of al��ter��nate splic��ing, we call the trans��cript a "pos��sible al��ter��nate splice." If the two trans��cripts cannot be rec��on��ciled in this way, we label the query a "clash." In par��ti��tio��ning splice dis��agree��ments into two cat��e��gor��ies, we are not as��sert��ing that 12.5% of this ge��nome shows al��ter��nate splic��ing. We do this as a form of triage: genes in the "clash" cat��e��go��ry are man��ually in��spect��ed before release.

in ref.in query
cassette exons243
retained introns19015
early 3′ splices3971
late 5′ splices7070

cassette exon an exon that falls completely with an intron of a variant transcript. Such exons may represent alternative splice forms but are more likely instances of exonic over- and under-prediction. retained intron an intron that falls within the exon of a variant transcript. These introns may indicate alternative splicing but usually are over- and under-predicted introns. early 3′ splices two introns agree on their 5′ splice site but differ on the 3′ side, relative to the affected intron. In other words, differing 3′ splice sites lie on the leading edge of an exon. late 5′ splices two introns agree on their 3′ splice site but differ on the 5′ side, again relative to the affected intron. Most ter��mi��nol��ogy is from Mat��lin AJ, et. al. Under��stand��ing al��ter��na��tive splic��ing: to��wards a cel��lular code. Nat Rev Mol Cell Biol. 2005 May;6(5):386-98.

Possible Problems

{multiple sources}3849
BD_JEL423_FINDORFS_5266
BD_JEL423_FGENESH_61991
BD_JEL423_GENEID_1397
BD_JEL423_MANUAL_156
BD_JEL423_GENEID_22260
short proteins < 50aa24704319← not tallied in problems
shorter proteins < 30aa0------
very short proteins < 10aa0------
initial exon ≤ 6bp107290396033
internal exon ≤ 6bp171214000
terminal exon ≤ 6bp10237038054
≥ 15 exons2012808122862
intron ≥ 1000bp0------
intron ≤ 20bp0------
first codon not Met4200020← not tallied in problems
first codon not xTG3200010
first codon not known START3200010
last codon not STOP12603021
contains in-frame STOP1000010
coding length not modulo 31000010
non-canonical splicing3000030
has ≥1 good BLAST hit35261464153116927640424← not tallied in problems
≤1/3 as long as BLAST hit0------
≥3�� longer than BLAST hit2078106332229
contains ≥1 N in exon0------
contains low-quality sequence1958799384829520524← not tallied in problems
touches gap(s)4730286010
spans contigs4730286010
within 1kb of contig edge4301754103252121← not tallied in problems
any overlap (UTR or CDS)593027325818← in 59 clusters
CDS overlap only0------
CDS overlap > 50bp0------
CDS overlap > 100bp0------
CDS overlap > 200bp0------
has predicted UTR18868422645177950134← not tallied in problems
UTR ≥ 50% length6823256365← not tallied in problems
UTR is spliced17879461591613← not tallied in problems
one or more problems767207292447019198