Methanosarcina acetivorans Multigene Family Analysis

See Genes by Multigene Family for a listing of all multigene families.

Multigene Family Analysis:

Methods

We identified multigene families in the Methanosarcina acetivorans genome by running blastp on the entire proteome. In order to filter out protein domains, we classified as paralogs only those genes with alignments (E<= 1e-5, score>10) that covered more than 60% of the longer gene. *

Genes were grouped into multigene families by single linkage. In order to measure the similarity between genes within a family, we calculated the Average Percent Identity for all blastp alignments between genes within the family. Additionally we computed the family Completeness ratio (observed # hits)/(total possible # hits) between genes in the family.

We performed the same analysis on 14 other archaeal and bacterial species. In order to compare families across genomes, we ran blastp to identify hits between each of these species and Methanosarcina acetivorans (E<=1e-5, score>10, hit length>=60%). We correlated inter-genome families based on the best bi-directional blastp alignment for any of the genes within the Methanosarcina acetivorans family. The 'corresponding' family contains all the orthologs to genes in the Methanosarcina acetivorans family, plus all paralogs to these orthologs.

Families were manually named by cursary inspection of the family members. Work remains to better determine these labels.

The 14 comparison complete genomes were:

Archaea (Euryarchaeota)
    Archaeoglobus fulgidus2.2Mb
    Halobacterium sp2.0Mb
    Methanobacterium thermoautotrophicum1.8Mb
    Pyrococcus abyssi1.8Mb
    Methanococcus jannaschii1.7Mb
    Pyrococcus horikoshii1.7Mb
Archaea (Crenarchaeota)
    Sulfolobus solfataricus3.0Mb
Bacteria
    Mesorhizobium loti7.0Mb
    Pseudomonas aeruginosa6.3Mb
    Escherichia coli K124.6Mb
    Bacillus subtilis4.2Mb
    Vibrio cholerae4.0Mb
    Synechocystis PCC68033.6Mb
    Thermotoga maritima1.9Mb

Results

48% MA genes are paralogs, and they cluster into 539 families (ranging in size from 2-85 genes per family)

There are significantly more family and larger clusters in MA than in the other completely sequenced archaea.

Genome Size

    Both the % paralogous genes and the number of clusters display a strong linear correlation with genome size, within the rough range:

    Genome Size% Paralogs# Clusters
    1.7 Mb25%180
    7.0 Mb50%650

    Genome Size vs
  #Clusters

    Genome Size vs
  Avg Cluster Size

Clusters Across Genomes

     All speciesArchaea
    Genes with ortholog44%36%
    Clusters with correlated cluster72%65%
    Clusters that correlate with > 1 cluster15%10%
    Clusters that correlate to larger clusters31%21%
    Non-Paralogous genes with correlated clusters43%37%

    Genome Size vs
  Avg Cluster Size

    Genome Size vs
  #Clusters

Normalized Cluster Size Across Genome

    Because of the linear relationship between genome size and cluster size, we have normalized the cluster sizes by genome size in order to highlight families with unusually large clusters. The graphs below illustrate the inter-genome comparison of cluster sizes after the cluster sizes have been normalized.

    Genome Size vs
  Avg Cluster Size

    Genome Size vs
  #Clusters

Data

Excel file paralogs.xls contains:

SpreadSheetDescription
SummaryWks Summary comparison of Methanosarcina versus other species
MAparalogsWks List of all paralogous Methanosarcina clusters and the names of all clustered genes
ClusterCorrelationWks Correlation of all Methanosarcina clusters with clusters in each of the other species.
Includes both raw cluster sizes, and cluster sizes normalized by genome size
GraphDescription
GenomeSize2ClusterSize Graph of Genome Size vs Cluster Size
GenomeSize2#Clusters Graph of Genome Size vs Number of Clusters
GenomeSize2%Paralogs Graph of Genome Size vs % Paralogous Genes
GenomeSize2Orfs Graph of Genome Size vs # Orfs
ClusterSizeCorr(>=10) Graph of correlated clusters, for all clusters >= 10 genes
ArchaeaClusterSizeCorr(>=10) Graph of archaeal correlated clusters, for all clusters >= 10 genes
NormalizedClusterSize(>=10) Graph of correlated clusters, for all clusters >= 10 genes
A_NormalizedClusterSize(>=10) Graph of correlated clusters, for all clusters >= 10 genes

References

* Similar to methods used for Vibrio cholerae (BLASTX E<=1e-5,>60% length of query ORF) [1], Thermotoga maritima (BLASTX E<=1e-5,>60% length of query ORF) [3], Helicobacter pylori (FASTA, >60% length of smaller ORF) [4], Archaeoglobus fulgidus (FASTA, >60% length of smaller ORF) [2].

[1] Heidelberg, John F. et al. DNA sequence of both chromosomes of the cholera pathogen Vibrio cholerae. Nature 406, 477-483 (2000).

[2] Klenk, Hans-Peter. et al. The complete genome sequence of the hyperthermophilic, sulphate-reducing archaeon Archaeoglobus fulgidus. Nature 390, 364-370 (1997).

[3] Nelson, Karen E. et al. Evidence for lateral gene transfer between Archaea and Bacteria from genome sequence of Thermotoga maritima. Nature 399, 323-329 (1999).

[4] Tomb, J.-F. et al. The complete genome sequence of the gastic pathogen Helicobacter pylori. Nature 388, 359-547 (1997).