Haploview currently accepts input data in five formats,
standard linkage format, completely or partially phased
haplotypes, HapMap Project data dumps, PHASE format, and PLINK outputs.
The program can also automatically fetch phased HapMap data off the HapMap website.
It also takes in a separate file with marker position information, as well as several auxiliary input files, described below. The four formats
are explained in depth below.
Linkage Format
Linkage data should be in the Linkage Pedigree (pre MAKEPED)
format, with columns of family, individual, father, mother,
gender, affected status and genotypes. The file should not
have a header line (i.e. the first line should be for the
first individual, not the names of the columns). Please note
that Haploview can only interpret biallelic markers —
markers with greater than two alleles (e.g. microsatellites)
will not work correctly. A sample line from such a file
might look something like:
3 12 8 9 1 2 1 2 3 3 0 0 4 2
a b c d e f -----------g------------
(a) pedigree name
A unique alphanumeric identifier for this individual's family. Unrelated individuals should not share a pedigree name.
(b) individual ID
An alphanumeric identifier for this individual. Should be unique within his family (see above).
(c) father's ID
Identifier corresponding to father's individual ID or "0" if unknown father. Note that if a father ID is specified, the father must also appear in the file.
(d) mother's ID
Identifier corresponding to mother's individual ID or "0" if unknown mother Note that if a mother ID is specified, the mother must also appear in the file.
(e) sex
Individual's gender (1=MALE, 2=FEMALE).
(f) affection status
Affection status to be used for association tests (0=UNKNOWN, 1=UNAFFECTED, 2=AFFECTED).
(g) marker genotypes
Each marker is represented by two columns (one for each allele, separated by a space) and coded either ACGT or 1-4 where: 1=A, 2=C, 3=G, T=4.
A 0 in any of the marker genotype position (as in the the genotypes for the third marker above) indicates missing data.
It is also worth noting that this format can be used with
non-family based data. Simply use a dummy value for the
pedigree name (1, 2, 3...) and fill in zeroes for father and
mother ID. It is important that the "dummy" value for the ped
name be unique for each individual. Affection status can be
used to designate cases vs. controls (2 and 1, respectively).
Files should also follow the following guidelines:
Families should be listed consecutively within the file
(i.e. all the lines with the same pedigree ID should be adjacent)
If an individual has a nonzero parent, the parent should
be included in the file on his own line.
Phased Haplotypes
Haplotype data for Haploview's input must be formatted in
columns of Family, Individual and Genotypes. There
should be two lines (chromosomes) for each individual. This is
the standard format of Genehunter's TDT output. See the sample
below:
The data format uses the numerals 1-4 to represent genotypes,
the number zero to represent missing data, and the letter "h" to
represent a heterozygous allele. That is, if an individual is
heterozygous at a locus, both alleles should be "h" if the
phasing (which allele falls on which chromosome) is uncertain.
HapMap Project Data Dumps
Data from the HapMap Project
can be dumped by region using the GBrowse interface.
The saved data file is in a marker-per-line format which
can be loaded in Haploview.
GBrowse dumps only one file, which has one marker per line
and which includes familial relationships among the HapMap
samples as well as marker position information. The file format
has several header lines (beginning with "#") which Haploview
parses. Open the file by selecting "Browse HapMap Data" option
and selecting the downloaded file.
If you wish to load data from another source in
HapMap style format, you will need to specify pedigree
information in the header of the file you've
created. This can be done by creating lines of the
following format at the top of your file:
#@ FAM01 NA0001 0 0 1 1
This data is the same as the pedfile format
discussed above. That is, the fields are
family,individual,father,mother,gender,affected
status. You would then replace the NAXXXX identifiers in
the header row of the HapMap file with your identifiers,
subject to two important constraints: they must be
unique across the entire dataset, not just within a
family and they must begin with the characters
NA.
HapMap PHASE Format
Data in the HapMap PHASE format can be loaded into
Haploview using three separate files. The first is the data file containing
binary allele information. The second is a sample file containing a single
column of the individual IDs used in the dataset. The third is a legend
file containing four columns: marker, position, 0, and 1. Only the legend
file requires a header and is used to decode the information in
the data file. These files can be loaded in as GZIP compressed files using the
"Files are GZIP compressed" checkbox on the initial loading screen. For more
information on the HapMap PHASE format, please see the
HapMap PHASE readme.
HapMap Download
Data in the HapMap PHASE format can also be automatically
downloaded into Haploview using the "HapMap Download" tab in the load screen by
specifying the HapMap Release, chromosome, analysis panel, and start and end positions
(in kb). These options can also be automatically filled in by querying the GeneCruiser
database with a gene or SNP ID. More information about the GeneCruiser database can be
found at the GeneCruiser website.
Marker Information File
The marker info file is two columns, marker name and
position. The positions can be either absolute chromosomal
coordinates or relative positions. It might look something
like this:
marker01 190299
marker02 190950
marker03 191287
An optional third column can be included in the info file to
make additional notes for specific SNPs. SNPs with additional
information are highlighted in green on the LD display. For instance,
you could make note that the first SNP is a coding variant as follows:
Output files from PLINK can be loaded into Haploview using the PLINK tab
on the initial loading screen. PLINK files must contain a header and at least
one column header must be titled "SNP" and contain the marker IDs for the results
in the file. PLINK loading also requires a standard PLINK map or binary map file corresponding
to the markers in the output file. The map file can be either three or four headerless columns
(the Morgan distance column is optional). The map file can also be embedded in the results file as the
first few columns of the file using the "Integrated Map Info" checkbox. You can load in non-SNP
based files as well by checking the "Non-SNP" box. These files do not require a map file. You can
choose to only load in one chromosome from your results file using the "Only load results from Chromosome"
checkbox and selecting a chromosome from the dropdown list. You can also select which columns to load
from your results file by checking the "Select Columns" checkbox. For a great deal more information
on PLINK outputs, please see Shaun Purcell's PLINK website.
Batch Load File
The "-batch" flag on the command line allows you to run
Haploview automatically (in nogui mode) on several files. Batch input
files should have one genotype file per line, along with an info file
(if desired) separated by a space. Filenames must conform to the
following rules:
Pedfile names must end in ".ped"
Phased haplotype file names must end in ".haps"
HapMap file names must end in ".hmp"
Info file names must end in ".info"
The following example shows 2 pedfiles (with info files) and a
hapmap file:
For any given tab the information in the display can be
saved. For the data check and association test tabs, a simple
tab-delimited text file is generated from the tables. For the LD
and Haplotype tabs, data can either be dumped to text files or
the image can be saved to a PNG.
LD Text Output File
LD text output is a tab delimited set of columns
containing the various measures of LD used by the
program. Details for each column are shown below:
L1 and L2 are the two loci in question,
referenced by their number or name (if marker info file is
provided)
D' is the value of D prime between the
two loci.
LOD is the log of the likelihood odds
ratio, a measure of confidence in the value
of D'
r2 is the correlation coefficient between
the two loci
CIlow is 95% confidence lower bound on
D'
CIhi is the 95% confidence upper bound on
D'
Dist is the distance (in bases) between the
loci, and is only displayed if a marker info file has been
loaded
T-int is a statistic used by the HapMap
Project to measure the completeness of information represented
by a set of markers in a region
Details about additional options for this output type can be
found below in the Export Options
section.
LD PNG Output
When saving the LD table to a PNG, Haploview saves an image
using the current display settings.
This includes color scheme, zoom and proportional
spacing. Thus, in order to save a less detailed image to a PNG,
first zoom out, then export the tab. Note that Haploview cannot
save large datasets at the higher zoom levels. For more
information see the Export Options section
below.
Haplotype Text Output File
Haplotype output shows a block, its markers, the haplotypes
and their population frequencies, the crossover percentages to
the next block and the multiallelic D prime. Crossover percentages
are shown as a matrix with this block's haplotypes as the rows
and the next block'shaplotypes as the columns. An example might look like:
In this example, the first block has 4 markers with 3
haplotypes displayed and the second block has 3 markers and 3
haplotypes. The tag SNPs for each block are (3,4) and (10,11)
respectively. The crossover percentage matrix can be read as follows:
80% of all samples have the pattern 3312-441, 3.1% have the
pattern 1144-441 and so forth.
Haplotype PNG Output
Saving the haplotype tab to a PNG produces an image using the
current display settings (such as haplotype frequency cutoff).
Single Marker Association Text Output File
Single marker association results are saved in a
tab-delimited text file with the following columns:
# is the marker number.
Name is the marker ID specified if an info file is loaded.
Chi Square is the chi square value for the marker.
p value is the significance level for the above chi square.
Trio (TDT) data only:
Overtransmitted is the allele overtransmitted to affected offspring.
T:U is the ratio of transmissions to non transmissions of the overtransmitted allele (see above).
Case-Control data only:
Major Alleles are the major alleles in the case and control populations respectively.
Case Control Ratios are the ratios (shown as either counts or quotients, depending on selected options) for the case and control populations, respectively.
Haplotype Association Text Output
Haplotype association text output is a tab-delimited file,
broken into sections by block. The columns are:
Haplotype is the sequence of alleles for this haplotype in this block.
Frequency is the population frequency for this haplotype.
Chi Square is the chi square value for the haplotype.
p value is the significance level for the above chi square.
Trio (TDT) data only:
T:U is the ratio of transmissions to non transmissions of the haplotype to affected offspring.
Case-Control data only:
Case Control Ratios are the ratios (shown as either counts or quotients, depending on selected options) for the case and control populations, respectively.
Permutation Text Output File
The output from the permutations tab shwos the number of permutations performed and then a tab-delimited table with one row per permuted test and the following columns:
Name is the test name, which is either a marker name or a comma separated list of marker names then a tab then a comma separated set of alleles for those markers.
Chi Square is the observed association chi square for that test.
Permutation p-value shows the significance of the test among the permutation tests.
Tagger Text Output File
The Tagger text output begins with several pieces of summary information. More details on this can be found in the Tagger section. The rest of the output is divided into two sections. The first lists each marker, with the following rows:
Marker is the marker name.
Best Test is the test with the highest r2 to this marker.
r^2 w/test is the r2 between this marker and its test.
The second part consists of a list of the tests and the alleles they capture best.
Tagger Tests Dump
This file is the same format used by Haploview for custom association tests and exported by Tagger. It is discussed below in the auxiliary files section.
Tagger Tags Dump
This file is the same format used by Haploview for custom association tests and exported by Tagger. It is discussed below in the auxiliary files section.
Marker Check Text Output File
The marker check data is a tab-delimited file with the
following columns:
# is the marker number.
Name is the marker ID specified (only if an info file is loaded).
Position is the marker position specified (only if an info file is loaded).
ObsHET is the marker's observed heterozygosity.
PredHET is the marker's predicted heterozygosity (i.e. 2*MAF*(1-MAF)).
HWpval is the Hardy-Weinberg equilibrium p value, which is the probability that its deviation from H-W equilibrium could be explained by chance.
%Geno is the percentage of non-missing genotypes for this marker.
FamTrio is the number of fully genotyped family trios for this marker (0 for datasets with unrelated individuals).
MendErr is the number of observed Mendelian inheritance errors (0 for datasets with unrelated individuals).
MAF is the minor allele frequency (using founders only) for this marker.
Alleles are the major and minor alleles for this marker.
Rating is "BAD" if the marker failed any of the above tests and blank otherwise.
PLINK Table Text Output File
The PLINK text output is a tab-delimited file of the current view
of the data in the PLINK tab. Please note that while the filtering state
is preserved in this output, the sorting state is not.
Export Options
The "Export Options" item in the File Menu allows adjustment of
several parameters and allows the user to save any tab without
having to switch to it. Specifically, the LD tab allow the markers
to be filtered to output only some of the markers:
All
The default setting (and only one available for most tabs) is to use all the markers.
Marker Range
Generates the LD text or PNG file for only a specific range of markers.
Adjacent Markers
Generates the LD text file for only adjacent markers. This can be useful to view the T-int stat, which measures LD information content in the gaps between markers.
There is also an option to generate a "compressed" LD PNG,
which is useful for very large datasets. The image is shrunk
to an arbitrary zoom level which allows Haploview to save the PNG
with minimal memory usage. Images can also be exported as high
quality SVG files for use in publication. Please note that SVG
images are quite large and may require a large amount of memory.
Auxiliary Input Files
Blocks File
You can specify a set of blocks by loading a blocks
file. Each line is a space separated list of markers with one
block per line. For example:
1 2 3 4
9 10 11 12 13 14 15
Would create one block from markers 1-4 and another from
9-15. The first marker in the file is number 1 (not 0).
Analysis Track File
You can add an analysis track along the top of the LD
display by loading a file with two columns, <position>
<value>. Haploview will plot the values continuously with
respect to the positions of the markers, so the positions should
use the same coordinates as the marker info file. For
example:
1000 0.3
2000 1.7
3000 11.0
4000 2.3
5000 4.6
Would plot a line from position 1000 to 5000. The values
can be of any units or magnitude, as the Haploview scales the
analysis track to the bounds of the values.
Custom Association Tests File
You can specify a set of custom association tests for Haploview to perform. The format takes both single marker tests and multi-marker tests (which require you to specify alleles for those markers). The format is one test per line with each line containing one of the following: a single marker name or several comma separated names, then a tab, then comma separated alleles for each marker. This format is exported by Haploview using the "Dump Tests" button in the Tagger Results panel and by Paul deBakker's Tagger webpage.
For instance, the following example would create 5 tests: markers 1, 2 and 3 individually, all the alleles (haplotypes) of the block 4,5,6 and the CAA haplotype of the block 12,13,14:
N.B. Using a Custom Association Tests File requires a marker info file, since the tests file reads the marker names as specified in the info file.
Tagger Marker Include/Exclude File
You can specifiy a list of markers for Tagger to include or exclude from those markers available for selection as tag SNPs. In either case the format is the same: one marker name per line. The following file could be used to either include or exclude markers 1,7 and 9:
marker1
marker7
marker9
N.B. Using a Tagger Include/Exclude File requires a marker info file, since it reads the marker names as specified in the info file.