Gene Finding Methods
Outline
- Overview
- Gene Structure Prediction
- Untranslated Region (UTR) Prediction
- Gene Naming
- Name counts
- Gene Locus Numbers
- Correspondence of version 2 and version 3 gene calls
- Structure Prediction Validation
- Overview of Query Genes
- Overview of Reference genes
- Feature Mapping
- Feature Comparisons
- Splice Analysis
- Possible Problems
Overview
This document describes how we created gene models for the third annotated gene set for Neurospora crassa. Automated gene calls were created in a four-step procedure:
- Gene location and structures were predicted by combining automated gene predictions, BLAST hits, EST alignments and manual annotation. This process is described in section Gene Structure Prediction.
- Gene names were assigned to predicted gene structures based on homology to genes from related species. This process is described in section Gene Naming.
- Names, symbols, ontology terms and other tags were assigned from the ongoing community annotation project. Details of this project may be found here.
- The newly created genes were compared against those in the previous version. This process is described in section Gene Correspondence.
- Both FGENESH and GENEID were run on the entire genomic sequence to provide an initial set of predicted genes. This resulted in a set containing 10,266 FGENESH predictions and 12,549 GENEID predictions.
- Next, GENEWISE was run on each BLAST hit from NR (after removing hits from N. crassa) that aligned to the genome with ≥ 80% identity and ≥ 80% similarity. This created 6,867 GENEWISE predictions. Because BLAST hits often overlap, the GENEWISE predictions fell into just 513 loci.
- 1265 genes were manually annotated from EST alignments and BLAST hits.
- Next, short predictions were filtered out. Any gene prediction less than 30aa (90nt) long was dropped. In addition, any FGENESH or GENEWISE gene prediction less than 50aa (150nt) long was dropped, unless it was overlapped by BLAST evidence, a hmmer hit, or EST evidence. All GENEID predictions, regardless of length, were dropped unless they had overlapping evidence.
-
Wherever remaining predictions had overlapping exons, on identical or opposing strands, we clustered them into separate groups for further filtering. A set of non-overlapping transcripts was chosen from each cluster by ordering the genes from "best" to "worst", picking the "best" gene, then going down the list and adding any lower-ranked transcripts that did not overlap ones already selected. Transcripts were ranked according to the following criteria:
- Intron lengths in fungi are much shorter than those in other eukaryotes. Even after training, FGENESH and GENEID occasionally predict genes that have introns that are much too long for fungi. If a transcript has an intron longer than 1000bp, it will not be chosen if there is a competing gene model with more reasonable intron lengths.
- Where EST evidence is available, ESTs are linked together to form partial splice models. Each transcript is compared to these EST transcripts, and its splice junctions are scored. Each splice junction may be a true positive, a false positive (the prediction has a splice site where the EST evidence does not), or a false negative (the EST evidence is spliced where the prediction is not). Transcripts with fewer false predictions are chosen above those with more. Those with more true positives are chosen over those with less. Splice scores are adjusted to account for alignment errors in ESTs as well as potential instances of alternate splicing.
- A transcript was considered to have "good BLAST coverage" if it overlapped one or more BLAST hits, each with ≥ 50% average identity and ≥ 70% query coverage. BLAST hits from N. crassa were ignored.
- Predictions with good BLAST coverage were chosen above predictions that did not have good BLAST coverage.
- If two predictions both had good BLAST coverage, we computed the "average BLAST length" for each transcript; that is, the average length of the top 1–3 overlapping BLAST hits from different taxa. (When there were more than three, the average was computed from the three with the highest scores.) The prediction closer in length to its average BLAST length was chosen first.
- Otherwise, FGENESH was preferred to GENEWISE, and both were preferred to GENEID.
- After sorting and selecting the highest-ranked predictions, the resultant gene set contained 9,826 genes.
- NAME, or
hypothetical protein similar to NAME, or
conserved hypothetical proteinAssigned to gene predictions where there is strong homology to a known NR protein. The criteria for this category are:
- At least one BLASTX hit to a known NR protein (complexity filtering off, -F F, expect ≤ 1e-5), with
- ≥ 50% identity and ≥ 70% coverage of the protein to the underlying genomic sequence, and
- at least 25% overlap between the protein aligment and the final prediction.
The name will follow one of these three formats, depending on our confidence in the name itself:
- conserved hypothetical protein if the homologous protein NAME contains a word indicating the name has not been verified: {fragment, homolog, hypothetical, like, predicted, probable, putative, related, similar, synthetic, unknown, unnamed}, or
- NAME if the homologous protein is from the curated Swiss-Prot gene set, otherwise
- hypothetical protein similar to NAME
Where there is more than one suitable name for a BLAST hit, we prefer Swiss-Prot names to non-Swiss-Prot names. If there are multiple distinct BLAST hits we choose the one with the highest average identity × the amount of overlap to the target gene.
In all cases we filter out the species name, GIs, parenthetical comments, extra whitespace, etc. from the NR name.
- Hypothetical protein
Assigned to gene predictions that show significant BLASTP homology to a protein in NCBI's protein set NR. The criteria for this category are:
- BLASTP hit to NR (complexity filtering off, -F F, expect ≤ 1e-5)
- Predicted protein
Assigned to gene predictions that do not show significant BLASTP homology to any proteins in NCBI's non-redundant set of proteins (NR) at the time that the complete BLASTP analysis was performed on the gene set.
- 9,565 genes correspond one-to-one between version 2 and version 3. Of these,
- 6,218 were completely unchanged from version 2,
- 1,399 had identical CDS, but the annotation was extended to include UTR,
- 612 had the same start and stop, but at least one splice site changed, and
- 1,336 had a different start and/or stop codon due to a substantial change in the underlying gene model.
- 94 genes were added in loci where there was no previous annotation due to new BLAST and/or EST evidence.
- 966 genes were deleted. Over half of these were short, coding for proteins less than 30aa long. The remainder were removed due to lack of evidence or poor gene models.
- 10 genes in version 2 were merged into 5 larger genes in version 3.
- 79 genes in version 2 were split into 162 smaller genes in version 3.
Gene Structure Prediction
Gene structures were predicted using a combination of manual annotation, FGENESH, GENEID, and GENEWISE. FGENESH is a commercial gene prediction program sold by Softberry, while GENEID, by Enrique Blanco and Roderic Guigó, is available under the GPL. GENEWISE is part of the WISE2 package developed by Ewan Birney and is available from the Sanger Center. We used GENEID 1.2 and GENEWISE 2.2. FGENESH is not versioned.
Both FGENESH and GENEID are classified as ab initio predictors; they predict genes from genomic sequence using statistical models of gene structure. Their models must first be trained against a set of trusted genes to ensure accurate prediction. Softberry trained FGENESH using proprietary methods and generated a parameter file for us. (We used the identical FGENESH parameter file for the version 2 gene set.) GENEID was trained using a set of 515 manual annotations on a previous assembly. The parameter file is available here. (Note, we changed the intron probabilities slightly from the parameters given by this link).
GENEWISE is a homology-based predictor of gene structure. As we ran it, it splices and aligns an input protein sequence to genomic sequence. When the resultant spliced alignment forms a valid gene model, we treat it as such. While GENEWISE does accept some parameters that vary from species to species (most notably for intron nucleotide statistics and splice site consensus sequences) these can be set to generic default values which are non–species-specific. In this default case, GENEWISE essentially produces the best local alignment of a protein assuming that introns tend to start at GT and end at AG. Since we are interested in predicting complete gene structures, we post-process incomplete GENEWISE protein alignments by moving the first and last exon upstream or downstream to the closest possible start and stop codons respectively. If a stop codon was encountered upstream of a gene before a start could be found, the gene call was not used.
Briefly, the gene predictions were combined as follows. The method we used for this new gene set is completely different from that used for gene set version 2.
Untranslated Region (UTR) Prediction
1949 genes in this set have annotated UTR. Of these, 1242 were annotated by hand. The remaining 707 instances of UTR were predicted computationally, as follows. When an EST alignment uniquely aligns with > 95% identity and overlaps a gene prediction, and the region of overlap has 100% coding agreement (e.g., every nucleotide in the region of overlap is coding in both the prediction and the alignment, or is noncoding in both), we consider the prediction and alignment to be compatible. UTR predictions are generated by walking out along compatible EST alignments from the end(s) of the prediction. A chain of one or more overlapping, compatible EST alignments that begin at the 3′ or 5′ end forms each UTR extension. Our UTR predictions are conservative. When an EST aligns to more than one location on the genome, or touches more than one gene prediction (on either strand) it is ignored.
Gene Naming
Genes are assigned names very conservatively. As this is a purely automated gene prediction process, we seek to avoid transferring unverified functional names from genes in other species to our predicted genes. Hits to previous N. crassa gene annotations are not considered in the following assignments.
We hope to improve the gene naming process in the future based on Gene Ontology categories.
There are currently five types of gene name that fall into three categories:
Name counts
| "conserved hypothetical protein" | 2189 |
| "hypothetical protein" | 393 |
| "predicted protein" | 4812 |
| hypothetical protein similar to... | 1354 |
| other non-empty name | 1098 |
Gene Locus Numbers
Every annotated gene is assigned a Locus Number of the form NCU#####. Locus numbers are the only guaranteed way to uniquely identify a gene. When the same gene exists in different assemblies and annotations, it will have the same locus number. Loci are simply identifiers; they do not have any internal structure. In particular they are not ordered in any way. We believe it is unreliable to encode attributes of an object, such as position, in its stable identifier. Instead, position and other attributes of a gene can be retrieved once you know its locus.
With each new gene set, we map all genes from the previous set and thus preserve loci whenever possible. Any loci that cannot be mapped 1:1 (due to deletions, splits or merges) will be retired. New genes that do not have a 1:1 mapping to a previous gene receive new locus identifiers. In addition to its locus, each gene has a version number (so loci are in fact displayed as NCU#####.#). When genes are mapped from one assembly to another or when we publish a new set of gene calls, we will increment this version. To ensure consistency, all loci in a particular gene set have the same version number, even if some have not changed from the previous set.
Both annotations and assemblies are versioned. Loci versions refer to the version of the annotation, not the assembly release number. This is annotation version 3 for N. crassa. Like the previous annotation (version 2) it was generated from assembly release 7. Version 1 of the annotation was built on assembly release 3. (Assembly releases 4–6 were not annotated.)
A locus number without an explicitly specified version corresponds to the most recent version of the annotation. For example, NCU03456.1 refers to locus NCU03456, annotation 1 (as annotated on assembly release 3). NCU03456.3 and NCU03456 both refer to the same locus as annotated in version 3 (on assembly release 7).
Correspondence of version 2 and version 3 gene calls
To compare version 2 and version 3 gene calls, 10,620 gene calls from version 2 were mapped to the 9,826 version 3 annotations.
Annotations added by the Community Annotation Project were transferred automatically to gene predictions in version 3 that were unambiguously equivalent to the same prediction in version 2 (e.g. same locus with altered UTR, splicing, etc.). Version 3 predictions that are ambiguous (merges, splits, etc.) are being manually processed to ensure that the annotation is moved to the correct new gene.
Five genes which have been characterized for phenotype have been altered (deleted, merged, or split) in version 3. Transfer of the phenotype information will be reviewed manually to ensure that, where possible, the correct gene from version 3 will receive the appropriate phenotype characterizations.
Structure Prediction Validation
To evaluate the accuracy of our gene predictions, we created a set of reference gene models exclusively from EST data. We then compared the two gene sets using a variety of metrics. In the tables below, we refer to the final, published gene set as the query and the EST-based gene set as the reference.
The Feature comparisons and Splice analysis sections only report on the subset of query genes that overlap reference genes. Although a substantial number of predicted genes overlap EST alignments, the majority do not. Because we use EST data to improve our gene calls, we expect lower accuracy in regions that lack supporting EST evidence, on the order of 5–10%. Therefore, while they are a useful measure of gene prediction accuracy, the numbers reported in those two sections do not apply evenly to all predicted genes.
Overview of Query Genes
9826 genes:
9846 transcripts
(8055 spliced, 1791 unspliced;
4869/4977 +/-)
27208 exons, 17362 introns
| len | %cov | %gc | %at | |
|---|---|---|---|---|
| genic | 17382499 | 44.31 | 54.67 | 45.33 |
| intergenic | 21843336 | 55.69 | 45.00 | 54.99 |
| exonic | 15021799 | 38.30 | 55.72 | 44.28 |
| intronic | 2364017 | 6.03 | 47.98 | 52.01 |
| coding | 14355925 | 36.60 | 56.11 | 43.89 |
| 5′ UTR | 212845 | 0.54 | 49.92 | 50.08 |
| 3′ UTR | 456039 | 1.16 | 46.05 | 53.95 |
| alt. spliced | 3317 | 0.01 | 50.08 | 49.92 |
| genomic | 39225835 | 100.00 | 49.29 | 50.71 |
min | median | mean | max | |
| overall length (incl. UTR) | 90 | 1299 | 1528 | 32463 |
| coding length | 90 | 1214 | 1459 | 32463 |
| exons per transcript | 1 | 2 | 2.76 | 24 |
| exons per spliced transcript | 2 | 3 | 3.16 | 24 |
| bp per exon | 1 | 263 | 553 | 15030 |
| bp per intron | 32 | 84 | 136 | 1388 |
| 5′ UTR bp | 1 | 121 | 161 | 986 |
| 3′ UTR bp | 1 | 246 | 276 | 1423 |
Overview of Reference genes
3197 genes:
3353 transcripts
(1319 spliced, 2034 unspliced;
1717/1636 +/-)
5580 exons, 2227 introns
| len | %cov | %gc | %at | |
|---|---|---|---|---|
| genic | 2310683 | 5.89 | 53.25 | 46.75 |
| intergenic | 36915152 | 94.11 | 49.04 | 50.96 |
| exonic | 2085241 | 5.32 | 53.68 | 46.32 |
| intronic | 235246 | 0.60 | 49.41 | 50.57 |
| coding | 1784385 | 4.55 | 54.85 | 45.15 |
| 5′ UTR | 42299 | 0.11 | 48.71 | 51.29 |
| 3′ UTR | 268953 | 0.69 | 46.44 | 53.56 |
| alt. spliced | 9804 | 0.02 | 51.31 | 48.69 |
| genomic | 39225835 | 100.00 | 49.29 | 50.71 |
min | median | mean | max | |
| overall length (incl. UTR) | 45 | 560 | 657 | 5100 |
| coding length | 45 | 490 | 559 | 4488 |
| exons per transcript | 1 | 1 | 1.66 | 10 |
| exons per spliced transcript | 2 | 2 | 2.69 | 10 |
| bp per exon | 4 | 365 | 395 | 5100 |
| bp per intron | 21 | 76 | 114 | 1740 |
| 5′ UTR bp | 1 | 107 | 124 | 598 |
| 3′ UTR bp | 1 | 166 | 187 | 954 |
Feature Mapping
Searched for mappings from 9826 query features to 3197 reference features.
Reference genes farther than 200bp from any query gene: 230
Containing 230 transcripts:
205 unspliced,
25 with canonical splices.
These may indicate missed genes.
51 query genes (0.5%) hit 44 reference genes (1.4%) on the opposing strand. Hits to opposing strands are considered misses in the following comparisions.
Strand mismatches such as these may be orientation problems in the reference set, or incorrect predictions in the query set. Unspliced reference transcripts are difficult to orient correctly. 32 reference gene(s) did not contain a single canonical splice site. These may easily have been placed on the wrong strand; queries hitting these genes are likely to be oriented correctly. 12 reference gene(s) had at least one canonical splice site; everything else being equal, their orientation is more likely to be correct than the queries hitting them.
The table below shows how query genes (rows, roman type) map to reference genes (columns, italic type). We cluster genes into overlapping groups on the same strand and record eight types of relationship. To the left of each arrow is the number of query genes in each type of cluster; the number of reference genes is to the right.
| none | reference one | many | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| none | - | 0 | ↔ | 450 | 0 | ↔ | 0 | |||
| query | one | 7763 | ↔ | 0 | 1402 | ↔ | 1402 | 622 | ↔ | 1315 |
| many | 2 | ↔ | 0 | 23 | ↔ | 11 | 14 | ↔ | 19 | |
| counts per cluster type | ||||||||||
| none | reference one | many | ||||||||
| none | - | 0 | ↔ | 14.1 | 0 | ↔ | 0 | |||
| query | one | 79.0 | ↔ | 0 | 14.3 | ↔ | 43.9 | 6.3 | ↔ | 41.1 |
| many | 0.0 | ↔ | 0 | 0.2 | ↔ | 0.3 | 0.1 | ↔ | 0.6 | |
| percentages per cluster type | ||||||||||
one↔one clear, unambiguous mappings from one query gene to one reference gene. none↔one reference genes that do not map to any query gene (misses). none↔many reference genes that overlap other reference genes but do not map to query genes. None↔X mappings may represent underpredictions. one↔none query genes that do not map to any reference gene. many↔none query genes that overlap other query genes but do not map to reference genes. X↔none mappings may represent overpredictions or deficiencies in the reference gene set. one↔many single query genes overlapped by multiple reference genes (merges). These may occur if the reference set contains partial genes. many↔one single reference genes spanning multiple query genes (splits). These are often annotation errors. many↔many complex clusters where multiple query genes map to multiple reference genes.
Feature Comparisons
Predicted genes that hit reference genes on the wrong strand, or hit reference genes containing only non–canonically-spliced transcripts, are ignored in the following calculations. When scoring genes with multiple transcripts, we select the single best transcript↔transcript comparison. For clusters in the one↔many and many↔many categories above, each query is compared to as many references as possible. The best non-overlapping comparisons are used to score these loci.
| #comp | TP | TN | FP | FN | ||
|---|---|---|---|---|---|---|
| nucleotides | 2015863 | 1766286 | 215385 | 2405 | 31787 | |
| splice sites | 6239 | 5595 | - | 401 | 243 | |
sn | sp | smc | acp | ac | cc | |
| nucleotides | 0.9823 | 0.9986 | 0.9830 | 0.9603 | 0.9207 | 0.9193 |
| splice sites | 0.9584 | 0.9331 | 0.8968 | - | - | - |
| #comp | TP | TN | FP | FN | ||
|---|---|---|---|---|---|---|
| introns | 2153 | 1870 | - | 111 | 11 | |
| exons | 3978 | 3417 | - | 3 | 24 | |
sn | sp | sn_sp | wrong | miss | ||
| introns | 0.9502 | 0.8686 | 0.9094 | 0.0056 | 0.0516 | |
| exons | 0.8290 | 0.8590 | 0.8440 | 0.0058 | 0.0008 |
#comp the number of subfeatures compared.
TP true positives: nucleotides predicted as exonic in both
query and reference; or, splice junctions with exact agreement, in
position and type (donor:donor, acceptor:acceptor), between query and
reference; or, introns or exons with both splice sites in exact
agreement with the reference.
TN true negatives: nucleotides predicted as intronic in
both query and reference; not defined for splice sites, introns or
exons.
FP false positives (overpredictions): nucleotides marked
as exonic in the query but intronic in the reference; or, splice
junctions identified only in the query*; or query introns/exons that
do not overlap a reference intron/exon.
FN false negatives (underpredictions): nucleotides marked as
intronic in the query but exonic in the reference; or, splice
junctions identified only in the reference*; or reference
introns/exons that do not overlap a query intron/exon.
sn sensitivity,
The table below briefly summarizes the relative coverage of
overlapping query and reference genes. The first column shows the
amount of sequence (in bp and percent) in each gene set that is
overlapped by sequence in the other gene set. The next three columns
show the overhanging sequence from transcripts that overlap but are
offset. Overhangs are categorized as 5′ (one transcript begins
upstream of the other), 3′ (one transcript extends downstream
past the other), and mixed (indeterminate). Finally, the last column,
"complete miss", counts the number of bp in queries that did not
hit a reference at all, and vice versa.
A high degree of overlap can indicate a good match between the query
and reference sets. A high "missed" score may simply indicate a small
reference set.
| overlap | overhang, 5′ | overhang, 3′ | overhang, mixed | complete miss | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| query | 2015863 bp | / | 11.6% | 784908 bp | / | 4.5% | 528976 bp | / | 3.0% | 763908 bp | / | 4.4% | 13292184 bp | / | 76.5% |
| reference | 2015863 bp | / | 87.0% | 32040 bp | / | 1.4% | 29623 bp | / | 1.3% | 19653 bp | / | 0.8% | 219600 bp | / | 9.5% |
Splice Analysis
5595 splice agreements, 644 disagreements.
3621 ignored:
42 due to EST misalignment,
3579 due to partial initial/terminal exon coverage.
perfect exon:exon/intron:intron matches: 1714/1870
0 query transcripts contained noncanonical splices.
| transcripts with no splice problems | 1813 | 88.0% |
| ... with complete reference coverage | 255 | 12.4% |
| explainable by alternate splicing | 105 | 5.1% |
| clashes | 142 | 6.9% |
Transcripts that have a splice site disagreement with an overlapping reference gene are placed into two categories, depending on the severity of the clash. If all splice disagreements could be explained by well-known types of alternate splicing, we call the transcript a "possible alternate splice." If the two transcripts cannot be reconciled in this way, we label the query a "clash." In partitioning splice disagreements into two categories, we are not asserting that 5.1% of this genome shows alternate splicing. We do this as a form of triage: genes in the "clash" category are manually inspected before release.
| in ref. | in query | ||
|---|---|---|---|
| cassette exons | ![]() | 6 | 0 |
| retained introns | ![]() | 9 | 80 |
| early 3′ splices | ![]() | 6 | 17 |
| late 5′ splices | ![]() | 5 | 8 |
cassette exon an exon that falls completely with an intron of a variant transcript. Such exons may represent alternative splice forms but are more likely instances of exonic over- and under-prediction. retained intron an intron that falls within the exon of a variant transcript. These introns may indicate alternative splicing but usually are over- and under-predicted introns. early 3′ splices two introns agree on their 5′ splice site but differ on the 3′ side, relative to the affected intron. In other words, differing 3′ splice sites lie on the leading edge of an exon. late 5′ splices two introns agree on their 3′ splice site but differ on the 5′ side, again relative to the affected intron. Most terminology is from Matlin AJ, et. al. Understanding alternative splicing: towards a cellular code. Nat Rev Mol Cell Biol. 2005 May;6(5):386-98.
Possible Problems
| {multiple sources} | 1947 | |||||||
|---|---|---|---|---|---|---|---|---|
| NC7_GENEWISE_32 | 38 | |||||||
| NC7_FGENESH_1 | 5346 | |||||||
| NC7_GENEID_2 | 1243 | |||||||
| NC7_MANUAL_1 | 1272 | |||||||
| short proteins < 50aa | 19 | 1 | 0 | 0 | 18 | 0 | ← not tallied in problems | |
| shorter proteins < 30aa | 0 | - | - | - | - | - | ||
| very short proteins < 10aa | 0 | - | - | - | - | - | ||
| initial exon ≤ 6bp | 145 | 19 | 0 | 106 | 20 | 0 | ||
| internal exon ≤ 6bp | 47 | 0 | 0 | 46 | 0 | 1 | ||
| terminal exon ≤ 6bp | 57 | 2 | 0 | 34 | 21 | 0 | ||
| ≥ 15 exons | 5 | 0 | 0 | 5 | 0 | 0 | ||
| intron ≥ 1000bp | 8 | 0 | 0 | 7 | 1 | 0 | ||
| intron ≤ 20bp | 0 | - | - | - | - | - | ||
| first codon not Met | 27 | 11 | 11 | 2 | 2 | 1 | ← not tallied in problems | |
| first codon not xTG | 5 | 0 | 0 | 2 | 2 | 1 | ||
| first codon not known START | 5 | 0 | 0 | 2 | 2 | 1 | ||
| last codon not STOP | 4 | 1 | 0 | 1 | 2 | 0 | ||
| contains in-frame STOP | 1 | 0 | 0 | 0 | 0 | 1 | ||
| coding length not modulo 3 | 1 | 0 | 0 | 0 | 0 | 1 | ||
| non-canonical splicing | 2 | 0 | 0 | 0 | 0 | 2 | ||
| has ≥1 good BLAST hit | 4602 | 1637 | 38 | 970 | 1105 | 852 | ← not tallied in problems | |
| ≤1/3 as long as BLAST hit | 51 | 10 | 0 | 10 | 25 | 6 | ||
| ≥3× longer than BLAST hit | 69 | 22 | 2 | 12 | 28 | 5 | ||
| contains ≥1 N in exon | 4 | 2 | 0 | 1 | 0 | 1 | ||
| contains low-scoring sequence | 0 | - | - | - | - | - | ||
| touches gap(s) | 0 | - | - | - | - | - | ||
| spans contigs | 0 | - | - | - | - | - | ||
| within 1kb of contig edge | 10 | 2 | 0 | 4 | 3 | 1 | ← not tallied in problems | |
| any overlap (UTR or CDS) | 32 | 3 | 0 | 16 | 9 | 36 | ← in 32 clusters | |
| CDS overlap only | 0 | - | - | - | - | - | ||
| CDS overlap > 50bp | 0 | - | - | - | - | - | ||
| CDS overlap > 100bp | 0 | - | - | - | - | - | ||
| CDS overlap > 200bp | 0 | - | - | - | - | - | ||
| has predicted UTR | 1949 | 274 | 2 | 262 | 169 | 1242 | ← not tallied in problems | |
| UTR ≥ 50% length | 216 | 12 | 0 | 13 | 23 | 168 | ← not tallied in problems | |
| UTR is spliced | 157 | 18 | 0 | 9 | 12 | 118 | ← not tallied in problems | |
| one or more problems | 442 | 58 | 2 | 226 | 105 | 51 | ||




