Gene Finding Methods

Outline

Overview

This document provides a general description of our automated genome annotation for eukaryotic genomes.

Gene annotation at Broad is a multi-step process.


back to top

Evidence collection

  • Blast evidence:
  • Blast homology search against the Genbank's NR database produces a set of raw blast out put. Individual blast alignments are then clustered into single blast clusters by linking the blast alignments derived from the same blast hit. Several such overlapping blast clusters on the genomic axis represents what we call as blast loci on the genome assembly. Currently, all blast data with e-values greater than 1e-10 are used considered as usable blast evidence.

  • EST evidence:
  • For most of our fungal and other eukaryotic genomes, we use species-specific ESTs sequenced here are broad as well as publicly available EST sequences from genbank to produce a set of high confidence gene models.

  • Pfam domains:
  • We run Hmmer searches using pFAM library to find pFAM domains on six-frame translations of the genomic sequence.

  • EST alignments:
  • First, we align ESTs to the genome using BLAT and then collapse them into distinct clusters transcripts using a Broad-developed program called CallReferenceGenes (described below). EST alignments with 90% identity over 50% of the EST length with canonical splice junctions are considered valid EST alignments suitable for building gene models.


back to top

Building EST-based gene models

EST clusters produced in the previously described section are used as inputs for building high confidence gene models for the purposes of training and evaluating gene finding tools.

FindEstOrf is an internal tool used for constructing CDS from EST transcripts. This tool assigns start and stop codons to create a longest ORFs. We then compare these constructed ORFs to overlapping BLAST evidence and pick the ones with comparable reading frame and protein length to the best blast hits from closely related species. FL_EST gene models created by this process are fully covered by EST alignments and are within + 20% of the length of homologous proteins.

In addition, we also build EST-based gene models by a manual process. By combining blast, EST and ab initio predictions produced using a parameter file from a related genome, manual annotators carefully build gene models that are otherwise missed by our highly conservative automated findORF script.


back to top

Computational prediction of tRNAs and RNAs

tRNA scan is used for detecting tRNAs on the genome assembly. We use RNA finding programs such as RNAmmer and RFamSearch to detect the common RNA features.


back to top

Training Gene Prediction Programs

Commonly used gene finding programs such as Augustus, GENEID, GeneMark, FGENESH and SNAP are trained in house or by the developers of these programs using the high confidence EST gene sets.


back to top

Computational Gene Models

Gene structures are predicted using a combination of gene models from computational gene prediction programs such as FGENESH, GENEID, GeneMark and EST-based automated and manual gene models. FGENESH is a commercial gene prediction program sold by Softberry, while GENEID, by Enrique Blanco and Roderic Guigo, is available under the GPL. GeneMark is another gene finding tool developed by the Mark Borodovsky's group (Borodovsky & McIninch, Comp. Chem., 1993). Both GENEID and FGENESH are usually trained using a set of high confidence EST-based gene models generated by clustering blat-aligned species-specific ESTs. In addition, we also use other commonly used gene finding programs such as SNAP (Ian Korf) and Augustus (Mario Stanke) if and when genome-specific parameter files are available.

After training the gene prediction programs, we run each of them on the assembly and evaluate their performance by comparing the gene models with EST and blast evidence. Those that perform adequately are used in the automated gene calling pipeline. The modular architecture of our automated pipeline makes it easy to incorporate new gene prediction programs and customize the pipeline to suit the genomes annotated.


back to top

Transfer Annotation

We use well-curated annotations from other sources and reference genomes to improve our automated annotations. Broad's in-house synteny-based gene transfer process has two main steps:

First, we generate collinear block alignments between the two genomes by creating pair-wise alignments between the two genomes. Further, a global alignment is generated for the entire region the collinear block covers.

Second, an in-house gene mapping program then transfers genes from reference onto the target genome within the specific syntenic blocks. We use genewise to further refine a gene model at each locus.


back to top

Selection of Consensus Gene Models

Broad's automated gene calling process uses a rule-based selection process to evaluate the evidence and build consensus gene models.

Ab initio predictions, blast and EST alignments, reference gene models, and manual and automated EST-gene models are clustered into potential gene loci. We select the most likely non-conflicting gene models based on the best evidence available at each locus. Our method uses heuristics such as splice agreement with ESTs and relative overlap with the BLAST hits to choose the prediction most in accord with the evidence. It does not have an internal model of gene structure and thus runs on a wide variety of eukaryotic and prokaryotic organisms without training.

Our gene selection process offers several useful options to exclude calling genes at certain loci. For example, we can exclude genes in regions with tRNA, rRNA and known repeat elements using a conservative overlap criterion.

Gene models with problems are tagged appropriately with curation flags and notes in the gene report to indicate potential problems. Despite all the progress in the field of gene finding, accurate gene finding on draft genomes is still a challenge. We make an effort to track easily identifiable problematic gene models and tag them with appropriate curation flags to alert the users of the nature of the problems. These tags are also used by manual annotators to specifically target manual editing and fine-tuning of bad gene models.


back to top

Predicting Alternatively Spliced Transcript Models

We do not predict alternatively spliced transcript unless there is manual or full-length EST evidence in support of their existence. Transcript models that differ only with respect to un-spliced 3' and 5' ends with the reference gene models are not considered evidence for alternative splicing. We include only canonically spliced and uniquely aligned ESTs with alternate splice junctions as valid alternatively spliced transcripts.


back to top

Predicting UTR

When an EST alignment uniquely aligns with >= 95% identity and overlaps a gene prediction, and the region of overlap has absolute labeling agreement (e.g., every nucleotide in the region of overlap is exonic in both the prediction and the alignment, or is intronic in both), we consider the prediction and alignment to be compatible. UTR predictions are generated by walking out along these compatible EST alignments from the end(s) of each prediction. A chain of one or more overlapping, compatible EST alignments that begin at the 5' or 3' end forms each UTR extension. When an EST aligns to more than one location on the genome, or touches more than one gene prediction (on either strand) it is ignored.


back to top

Assigning Gene Product Names

We do not assign gene symbols. Instead, our gene naming protocol currently relies on high confidence blast homology and in some cases, community inputs to assign gene product names. We hope to improve the gene naming process in the future based on other functional annotation protocols and tools.


back to top

Reporting Annotation Accuracy

We run in-depth reports on each annotation we produce to get a measure of our annotation accuracy. A host of accuracy statistics is compiled for each prediction that touches an EST; following Guigo's Evaluation of gene structure prediction programs (Genomics, 1996 Jun 15; 34(3):353-67). We calculate specificity, sensitivity, correlation coefficient and simple matching coefficient on the levels of nucleotides, splice sites, introns and exons.


back to top

Gene Numbering

Every annotated gene is given a Locus Number of the form MGG_##### that should be considered the only guaranteed way to identify a gene uniquely. Each locus number is guaranteed to identify a unique gene even over different assemblies. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure.

The mapping of gene accessions from the previous release to the current version is available here.


back to top

Overview of Reference genes

8358 genes (4531 on '+' strand, 3827 on '–')
9217 transcripts (3374 spliced, 5843 unspliced)
14855 exons, 5638 introns

len%cov%gc%at
genic532845312.7853.1546.85
intergenic3636700487.2251.3448.66
exonic481161711.5453.7746.23
intronic4707691.1347.0852.92
coding41299849.9154.7845.22
5′ UTR3183310.7650.2549.75
3′ UTR4805711.1546.4953.51
alt. spliced460670.1150.9049.10
genomic4161795799.8151.5748.43

min

median

mean

n50

max
total length (incl. UTR + introns)225656507343486
coding length224534885672421
exons per transcript111.61213
exons per spliced transcript222.67513
nt per exon43123615022885
nt per intron2387112108978
intergenic nt1211843539506183972
5′ UTR nt117671601212
3′ UTR nt1501022372216

back to top

Overview of Query Genes

11043 genes (5580 on '+' strand, 5463 on '–')
11054 transcripts (8800 spliced, 2254 unspliced)
30705 exons, 19651 introns

len%cov%gc%at
genic1936128246.4355.9844.02
intergenic2233417553.5747.7252.28
exonic1696875040.7057.2342.77
intronic23904215.7347.1252.88
coding1597740838.3257.9242.08
5′ UTR3995840.9647.8952.11
3′ UTR6028201.4545.0254.98
alt. spliced21110.0151.8348.17
genomic4161795799.8151.5748.43

min

median

mean

n50

max
total length (incl. UTR + introns)13215031754207319707
coding length9911941446179119500
exons per transcript122.78318
exons per spliced transcript233.23318
nt per exon3316553105612935
nt per intron4881221221029
intergenic nt1102320283838167583
5′ UTR nt21361692251077
3′ UTR nt21962403231465

back to top

Feature Mapping

Searched for mappings from 11043 query features to 8358 reference features.

Reference genes farther than 200 nt from any query gene: 1452
Containing 1507 transcripts: 1330 unspliced (713 with CDS), 0 with non-canonical splices (0 with CDS), 177 with canonical splices (71 with CDS).
These may indicate missed genes.

379 query genes (3.4%) hit 555 reference genes (6.6%) on the opposing strand. Hits to opposing strands are considered misses in the following comparisions.

Strand mismatches such as these may be orientation problems in the reference set, or incorrect predictions in the query set. Unspliced reference transcripts are difficult to orient correctly. 522 reference gene(s) were unspliced. These may easily have been placed on the wrong strand; queries hitting these genes are likely to be oriented correctly. 33 reference gene(s) had at least one canonically-spliced transcript; everything else being equal, their orientation is more likely to be correct than the queries hitting them.

The table below shows how query genes (rows, roman type) map to reference genes (columns, italic type). We cluster genes into overlapping groups on the same strand and record eight types of relationship. To the left of each arrow is the number of query genes in each type of cluster; the number of reference genes is to the right.

nonereference
one
many
none-0254600
queryone63490365936599952125
many0030151013
counts per cluster type
 
nonereference
one
many
none-030.500
queryone57.5033.143.89.025.4
many000.30.20.10.2
percentages per cluster type

one↔one clear, unambiguous mappings from one query gene to one reference gene. none↔one reference genes that do not map to any query gene (misses). none↔many reference genes that overlap other reference genes but do not map to query genes. None↔X mappings may represent underpredictions. one↔none query genes that do not map to any reference gene. many↔none query genes that overlap other query genes but do not map to reference genes. X↔none mappings may represent overpredictions or deficiencies in the reference gene set. one↔many single query genes overlapped by multiple reference genes (merges). These may occur if the reference set contains partial genes. many↔one single reference genes spanning multiple query genes (splits). These are often annotation errors. many↔many complex clusters where multiple query genes map to multiple reference genes.


back to top

Feature Comparisons

Predicted genes that hit reference genes on the wrong strand, or hit reference genes containing only non–canonically-spliced transcripts, are ignored in the following calculations. When scoring genes with multiple transcripts, we select the single best transcript↔transcript comparison. For clusters in the one↔many and many↔many categories above, each query is compared to as many references as possible. The best non-overlapping comparisons are used to score these loci.

#compTPTNFPFN
nucleotides40542843554853457456429337682
splice sites1431113335-551425

sn

sp

smc

acp

ac

cc
nucleotides0.98950.99880.98960.97570.95150.9510
splice sites0.96910.96030.9318---

#compTPTNFPFN
introns45834154-11313
exons87257906-441

sn

sp

sn_sp

wrong

miss

introns0.96110.90640.93380.00300.0247
exons0.84660.90610.87640.00440.0005

#comp the number of subfeatures compared. TP true positives: nucleotides predicted as exonic in both query and reference; or, splice junctions with exact agreement, in position and type (donor:donor, acceptor:acceptor), between query and reference; or, introns or exons with both splice sites in exact agreement with the reference. TN true negatives: nucleotides predicted as intronic in both query and reference; not defined for splice sites, introns or exons. FP false positives (overpredictions): nucleotides marked as exonic in the query but intronic in the reference; or, splice junctions identified only in the query*; or query introns/exons that do not overlap a reference intron/exon. FN false negatives (underpredictions): nucleotides marked as intronic in the query but exonic in the reference; or, splice junctions identified only in the reference*; or reference introns/exons that do not overlap a query intron/exon. sn sensitivity, TP ÷ (TP+FN). sp specificity, TP ÷ (TP+FP). smc simple matching coefficient, (TP+TN) ÷ (TP+FN+FP+TN). acp average conditional probability. ac approximate correlation. cc correlation coefficient. These last three terms are described in detail in the citation below, and are not well defined for splices, introns or exons. *In the very rare cases where the query predicts a donor site exactly where the reference has an acceptor, or vice versa, we count it as both FP and FN. Terminology is from Burset M and Guigo R, "Evaluation of gene structure prediction programs." Genomics, 1996 Jun 15;34(3):353-67.

The table below briefly summarizes the relative coverage of overlapping query and reference genes. The first column shows the amount of sequence (in nt and percent) in each gene set that is overlapped by sequence in the other gene set. The next three columns show the overhanging sequence from transcripts that overlap but are offset. Overhangs are categorized as 5′ (one transcript begins upstream of the other), 3′ (one transcript extends downstream past the other), and mixed (indeterminate). Finally, the last column, "complete miss", counts the number of nt in queries that did not hit a reference at all, and vice versa.

A high degree of overlap can indicate a good match between the query and reference sets. A high "missed" score may simply indicate a small reference set.

overlapoverhang, 5′overhang, 3′overhang, mixedcomplete miss
query4054284 nt/20.9%2364563 nt/12.2%1214752 nt/6.3%1066835 nt/5.5%10661781 nt/55.1%
reference4054284 nt/76.3%30403 nt/0.6%33953 nt/0.6%26294 nt/0.5%1166359 nt/22.0%

back to top

Splice Analysis

13335 splice agreements, 976 disagreements.
6624 ignored: 216 due to EST misalignment, 6408 due to partial initial/terminal exon coverage.
perfect exon:exon/intron:intron matches: 4709/4154
0 query transcripts contained noncanonical splices.

transcripts with no splice problems430791.8%
   ... with complete reference coverage90119.2%
explainable by alternate splicing1623.5%
   ... with spliced reference1132.4%
clashes2254.8%
   ... with spliced reference1022.2%

Transcripts that have a splice site dis­agree­ment with an over­lap­ping ref­er­ence gene are placed into two cate­gor­ies, de­pend­ing on the se­ver­ity of the clash. If all splice dis­agree­ments could be ex­plain­ed by well-known types of al­ter­nate splic­ing, we call the trans­cript a "pos­sible al­ter­nate splice." If the two trans­cripts cannot be rec­on­ciled in this way, we label the query a "clash." In par­ti­tio­ning splice dis­agree­ments into two cat­e­gor­ies, we are not as­sert­ing that 3.5% of this ge­nome shows al­ter­nate splic­ing. We do this as a form of triage: genes in the "clash" cat­e­go­ry are man­ually in­spect­ed before release.

in ref.in query
cassette exons40
retained introns8812
early 3′ splices1541
late 5′ splices1220

cassette exon an exon that falls completely with an intron of a variant transcript. Such exons may represent alternative splice forms but are more likely instances of exonic over- and under-prediction. retained intron an intron that falls within the exon of a variant transcript. These introns may indicate alternative splicing but usually are over- and under-predicted introns. early 3′ splices two introns agree on their 5′ splice site but differ on the 3′ side, relative to the affected intron. In other words, differing 3′ splice sites lie on the leading edge of an exon. late 5′ splices two introns agree on their 3′ splice site but differ on the 5′ side, again relative to the affected intron. Most ter­mi­nol­ogy is from Mat­lin AJ, et. al. Under­stand­ing al­ter­na­tive splic­ing: to­wards a cel­lular code. Nat Rev Mol Cell Biol. 2005 May;6(5):386-98.


back to top

Possible Problems

multiple sources5538
no evidence feature876
MG6_FGENESH_31067
MG6_FL_EST_GENES93
MG6_GENEID_2589
MG6_GENEMARK1389
MG6_GENEWISE_196
MG6_MANUAL_1953
MG6_SNAP_7453
short protein, < 50 aa15+0401054010← not tallied in problems
shorter protein, < 30 aa0+0---------
very short protein, < 10 aa0+0---------
initial exon ≤ 6 nt18+1444023011
internal exon ≤ 6 nt5+0121000010
terminal exon ≤ 6 nt12+0062020020
≥ 15 exons2+0010000010
intron ≥ 1000 nt1+0000000010
intron ≤ 20 nt16+10000000314
first codon not Met1+0000000100← not tallied in problems
first codon not xTG0+0---------
first codon not known START0+0---------
last codon not known STOP1+1000000020
contains in-frame STOP0+0---------
coding length not modulo 30+1000000010
non-canonical splicing0+0---------
has ≥1 tagged BLAST hit8551+11844558607565250574096764441← not tallied in problems
≤1/3 as long as BLAST tags72+5201877191113
≥3× longer than BLAST tags16+10300110120
contains ≥1 N in exon57+74727546400267
low-quality exonic sequence0+14220234245180255← not tallied in problems
touches gap(s)7+8205013040
spans contigs7+8205013040
within 1 kb of contig edge61+192821904102132← not tallied in problems
any overlap (UTR or CDS)121+067474172227← in 60 clusters
CDS overlap only7+0000100060← in 3 clusters
CDS overlap > 50 nt7+0000100060← in 3 clusters
CDS overlap > 100 nt7+0000100060← in 3 clusters
CDS overlap > 200 nt7+0000100060← in 3 clusters
has predicted UTR3509+442211184223911602333332494← not tallied in problems
UTR ≥ CDS length346+3245517251392294← not tallied in problems
UTR is spliced196+3120713713102207← not tallied in problems
one or more problems319+82100477815173337731

back to top