Methanosarcina acetivorans Multigene Family Analysis
See Genes by Multigene Family for a listing of all multigene families.Multigene Family Analysis:
Methods
We identified multigene families in the Methanosarcina acetivorans genome by running blastp on the entire proteome. In order to filter out protein domains, we classified as paralogs only those genes with alignments (E<= 1e-5, score>10) that covered more than 60% of the longer gene. *Genes were grouped into multigene families by single linkage. In order to measure the similarity between genes within a family, we calculated the Average Percent Identity for all blastp alignments between genes within the family. Additionally we computed the family Completeness ratio (observed # hits)/(total possible # hits) between genes in the family.
We performed the same analysis on 14 other archaeal and bacterial species. In order to compare families across genomes, we ran blastp to identify hits between each of these species and Methanosarcina acetivorans (E<=1e-5, score>10, hit length>=60%). We correlated inter-genome families based on the best bi-directional blastp alignment for any of the genes within the Methanosarcina acetivorans family. The 'corresponding' family contains all the orthologs to genes in the Methanosarcina acetivorans family, plus all paralogs to these orthologs.
Families were manually named by cursary inspection of the family members. Work remains to better determine these labels.
The 14 comparison complete genomes were:
| Archaea (Euryarchaeota) | |
| Archaeoglobus fulgidus | 2.2Mb |
| Halobacterium sp | 2.0Mb |
| Methanobacterium thermoautotrophicum | 1.8Mb |
| Pyrococcus abyssi | 1.8Mb |
| Methanococcus jannaschii | 1.7Mb |
| Pyrococcus horikoshii | 1.7Mb |
| Archaea (Crenarchaeota) | |
| Sulfolobus solfataricus | 3.0Mb |
| Bacteria | |
| Mesorhizobium loti | 7.0Mb |
| Pseudomonas aeruginosa | 6.3Mb |
| Escherichia coli K12 | 4.6Mb |
| Bacillus subtilis | 4.2Mb |
| Vibrio cholerae | 4.0Mb |
| Synechocystis PCC6803 | 3.6Mb |
| Thermotoga maritima | 1.9Mb |
Results
48% MA genes are paralogs, and they cluster into 539 families (ranging in size from 2-85 genes per family)There are significantly more family and larger clusters in MA than in the other completely sequenced archaea.
Genome Size
Both the % paralogous genes and the number of clusters display a strong linear correlation with genome size, within the rough range:
| Genome Size | % Paralogs | # Clusters |
|---|---|---|
| 1.7 Mb | 25% | 180 |
| 7.0 Mb | 50% | 650 |
Clusters Across Genomes
| All species | Archaea | |
|---|---|---|
| Genes with ortholog | 44% | 36% |
| Clusters with correlated cluster | 72% | 65% |
| Clusters that correlate with > 1 cluster | 15% | 10% |
| Clusters that correlate to larger clusters | 31% | 21% |
| Non-Paralogous genes with correlated clusters | 43% | 37% |
Normalized Cluster Size Across Genome
Because of the linear relationship between genome size and cluster size, we have normalized the cluster sizes by genome size in order to highlight families with unusually large clusters. The graphs below illustrate the inter-genome comparison of cluster sizes after the cluster sizes have been normalized.
Data
Excel file paralogs.xls contains:
| SpreadSheet | Description |
|---|---|
| SummaryWks | Summary comparison of Methanosarcina versus other species |
| MAparalogsWks | List of all paralogous Methanosarcina clusters and the names of all clustered genes |
| ClusterCorrelationWks | Correlation of all Methanosarcina clusters with clusters in
each of the other species. Includes both raw cluster sizes, and cluster sizes normalized by genome size |
| Graph | Description |
| GenomeSize2ClusterSize | Graph of Genome Size vs Cluster Size |
| GenomeSize2#Clusters | Graph of Genome Size vs Number of Clusters |
| GenomeSize2%Paralogs | Graph of Genome Size vs % Paralogous Genes |
| GenomeSize2Orfs | Graph of Genome Size vs # Orfs |
| ClusterSizeCorr(>=10) | Graph of correlated clusters, for all clusters >= 10 genes |
| ArchaeaClusterSizeCorr(>=10) | Graph of archaeal correlated clusters, for all clusters >= 10 genes |
| NormalizedClusterSize(>=10) | Graph of correlated clusters, for all clusters >= 10 genes |
| A_NormalizedClusterSize(>=10) | Graph of correlated clusters, for all clusters >= 10 genes |
References
* Similar to methods used for Vibrio cholerae (BLASTX E<=1e-5,>60% length of query ORF) [1], Thermotoga maritima (BLASTX E<=1e-5,>60% length of query ORF) [3], Helicobacter pylori (FASTA, >60% length of smaller ORF) [4], Archaeoglobus fulgidus (FASTA, >60% length of smaller ORF) [2].[1] Heidelberg, John F. et al. DNA sequence of both chromosomes of the cholera pathogen Vibrio cholerae. Nature 406, 477-483 (2000).
[2] Klenk, Hans-Peter. et al. The complete genome sequence of the hyperthermophilic, sulphate-reducing archaeon Archaeoglobus fulgidus. Nature 390, 364-370 (1997).
[3] Nelson, Karen E. et al. Evidence for lateral gene transfer between Archaea and Bacteria from genome sequence of Thermotoga maritima. Nature 399, 323-329 (1999).
[4] Tomb, J.-F. et al. The complete genome sequence of the gastic pathogen Helicobacter pylori. Nature 388, 359-547 (1997).
