Sizing up "SNP chips"

Undertaking large-scale analyses of genetic variation tends to be a sizeable bet of a laboratory’s limited resources. Although there are a handful of new research tools that now make it feasible to survey an individual’s genome for hundreds of thousands of genetic differences, these tools still provide only a limited view of the genome’s total variability. But the limitations may be less than believed, and the bet less risky. As these tools are at the center of many large-scale efforts to identify the genetic variants associated with human disease, it is necessary to determine how well they capture common genetic variation and whether their performance can be further enhanced. This critical evaluation has been done by Massachusetts General Hospital and Broad Institute scientists, and described in the May 21 online edition of Nature Genetics. Their work reveals that the currently available products cover more than half of the common variation in the human genome, and that, in some cases, this can be extended to more than 75% by incorporating data from the International Haplotype Map ("HapMap") Project.

"For the first time, scientists are poised with practical, genome-scale tools for finding the genetic variants that are associated with human diseases," said Mark Daly, a senior author of the study, an assistant professor in the Center for Human Genetic Research at Massachusetts General Hospital and an associate member of the Broad. "Our work indicates that these tools provide substantial coverage of the most common genetic differences, and this coverage can be extended even further using an analytical framework built upon HapMap data."

Since the HapMap’s groundbreaking efforts to catalogue human genetic differences first began, three commercially available "whole-genome" products have recently appeared. These make it possible to scrutinize genetic variability by simultaneously analyzing the single nucleotide polymorphisms ("SNPs") in an individual's DNA. The new "chips" include two from Affymetrix, one with roughly 100,000 SNPs and the other with around 500,000 SNPs, and a chip from Illumina, with more than 300,000 SNPs. While the SNPs sampled by these different products were chosen based on distinct criteria, and therefore, may differ in their precise coverage of the genome, the tools provide only a partial snapshot of human genetic variation: There are, at last count, around 11 million estimated variable sites in the human genetic code. Nevertheless, the tools’ emergence is a historic moment in medical genetics, and makes it possible for the first time to conduct comprehensive studies that correlate genetic differences with common human diseases.

To evaluate the robustness of these chips, the scientists took advantage of the fact that the SNPs represented in the Affymetrix and Illumina products have already been analyzed in 270 reference samples from four geographically distinct populations as part of the HapMap project. They directly compared this information to two current "gold-standard" collections of SNP data: the heavily sequenced genomic regions from the ENCODE project, which have near-complete coverage of common SNPs (those with a frequency of greater than or equal to 5%), and the SNPs analyzed at sites throughout the genome in Phase II of the HapMap, which number around 3.9 million.

The Nature Genetics paper authors, including first author Itsik Pe'er and co-authors Paul de Bakker, Julian Maller, Roman Yelensky, David Altshuler and Mark Daly, found that the Affymetrix and Illumina products on their own cover a substantial portion of the genetic variation commonly found among humans. For instance, in one of the HapMap panels of European ancestry, the 500K Affymetrix chip captures more than 60% of the SNPs represented in both the ENCODE regions and the Phase II HapMap data. Similar figures emerge from analyses of the combined Han Chinese plus Japanese panel. Yet, as expected, the scientists found a lower level of coverage for the Yoruba panel, an African population that is genetically "older" and more diverse, and therefore would be expected to require a higher density of SNPs to achieve a comparable level of coverage.

By taking advantage of existing HapMap data — and the fact that some SNPs are inherited in blocks due to a genetic phenomenon called “linkage disequilibrium” — the scientists discovered that the coverage provided by the Affymetrix and Illumina tools can be stretched even further and in some cases, extended beyond 80% of the common SNPs. The benefit of this approach is that the analyses can be done "in silico," without expending additional scientific resources, to sample SNPs not actually included on the chips. In addition, the researchers describe methods, based on Bayes' theorem, to further increase the statistical power for detecting disease-causing genetic variants.

Though geneticists eagerly await the next-generation tools that will offer even more complete coverage of the SNPs in the human genome, the Nature Genetics study demonstrates that the current products provide a sufficiently extensive view of genetic variation and the first practical opportunity to unearth the genetic differences that underlie common human diseases.

Paper(s) cited

Pe'er I, de Bakker PI, Maller J, Yelensky R, Altshuler D & Daly MJ. (2006) Evaluating and improving power in whole-genome association studies using fixed marker sets. Nature Genetics; doi:10.1038/ng1816