Broad Institute home > programs > cancer program > software > genecluster 2.0 > reference

GeneCluster 2 Reference Guide

Table of contents

Introduction
Loading Data
Data Preprocessing
Data Analysis
Reports
Lymphoma Outcome Prediction Example


Introduction

This document describes the operation of GeneCluster 2.0 developed by Whitehead Institute/MIT Center for Genome Research (WICGR). This document will introduce the user to the basic features and usage of GeneCluster 2. Both descriptions of the features and examples of their usage are provided to help the user understand their usage.


Loading Data into GeneCluster

The first step required when running GeneCluster is the loading of the raw expression data. The file open dialog is obtained by selecting Open... item from the File pull down menu on GeneCluster's main panel.

Selecting the Open... item from the File pull down menu produces the following file Open window:

Two data file formats are currently supported by GeneCluster's file Open... task. One is the WICGR RES file format (*.res). The other is the GCT (Gene Cluster Text) file format (*.gct). The main difference between the two file formats is the RES file format contains labels for each gene's absent (A) versus present (P) calls as generated by Affymetrix's GeneChip software. GeneCluster also supports loading a CLS file format (*.cls) for reading in sample class labels. The details of these formats are described below.

RES File Format

This is a tab delimited file format that is organized as follows:

  • The first line contains a list of the sample identifier labels associated with each of the columns in the remainder of the file. Double tabs separate the sample identifier tags because each sample contains two data values (an expression value and a present/absent marker).
    • Line format: Description (tab) Accession (tab) (sample 1 name) (tab) (tab) (sample 2 name) (tab) (tab) ... (sample N name)
    • For example: Description Accession DLBC1_1 DLBC2_1 ... DLBC58_0
  • The second line contains a list of sample descriptions. Currently, GeneCluster ignores these descriptions.
    • Line format: (tab) (sample 1 description) (tab) (tab) (sample 2 description) (tab) (tab) ... (sample N description)
    • For example, our RES file creation tool places the sample data file name and scale factors in this row: MG2000062219AA MG2000062256AA/scale factor=1.2172 ... MG2000062211AA/scale factor=1.1214
  • The remainder of the data file contains data for each of the genes. There is one line for each gene and two columns for each of the samples. The first two fields in the line contain the description and name for each of the genes (names and descriptions can contain spaces since fields are separated by tabs). Each sample has two pieces of data associated with it: and expression value and an Absent/Marginal/Present (A/M/P) call associated with it. The A/P calls are generated by microarray scanning software (such as Affymetrix's GeneChip software) and are an indication of the confidence in the measured expression value. Currently, GeneCluster ignores the absent/present call. The description is optional but the tab following it is not.
    • Line format: (gene description) (tab) (gene name) (tab) (sample 1 data) (tab) (sample 1 A/P call) (tab) (sample 2 data) (tab) (sample 2 A/P call) (tab) ... (sample N data) (tab) (sample N A/P call)
    • For example: AFFX-BioB-5_at (endogenous control) AFFX-BioB-5_at -104 A -152 A ... -44 A

GCT File Format

This is a tab delimited file format that is generally organized as follows:

  • The first line contains the version string and is always the same for this file format
    • For example: #1.2
  • The second line contains numbers indicating the size of the data table that is contained in the remainder of the file. Note that the name and description columns are not included in the number of data columns.
    • Line format: (# of data rows) (tab) (# of data columns)
    • For example: 7129 58
  • The third line contains a list of the sample identifiers associated with each of the columns in the remainder of the file.
    • Line format: Name (tab) Description (tab) (sample 1 name) (tab) (sample 2 name) (tab) ... (sample N name)
    • For example: Name Description DLBC1_1 DLBC2_1 ... DLBC58_0
  • The remainder of the data file contains data for each of the genes. There is one line for each gene and one column for each of the samples. The first two fields in the line contain name and descriptions for the genes (names and descriptions can contain spaces since fields are separated by tabs). The number of lines should agree with the number of data rows specified on line 2.
    • Line format: (gene name) (tab) (gene description) (tab) (col 1 data) (tab) (col 2 data) (tab) ... (col N data)
    • For example: AFFX-BioB-5_at AFFX-BioB-5_at (endogenous control) -104 -152 -158 ... -44

Occasionally, GCT files are organized in a transposed structure where the columns represent genes and the rows represent samples. The user should take care to check the organization of the file to ensure that the correct preprocessing is performed on the file (a warning panel stating this will appear when the user loads a GCT file). See sample *.gct files that come with the distribution for complete examples of the format.

CLS File Format

The CLS files are simple files created to load class information into GeneCluster. These files use spaces to separate the fields.

  • The first line of a CLS file contains numbers indicating the number of samples and number of classes. The number of samples should correspond to the number of samples in the associated RES or GCT data file.
    • Line format: (number of samples) (space) (number of classes) (space) 1
    • For example: 58 2 1
  • The second line in a CLS file contains names for the class numbers. The line should begin with a pound sign (#) followed by a space.
    • Line format: # (space) (class 0 name) (space) (class 1 name)
    • For example: # cured fatal/ref.
  • The third line contains numeric class labels for each of the samples. The number of class labels should be the same as the number of samples specified in the first line.
    • Line format: (sample 1 class) (space) (sample 2 class) (space) ... (sample N class)
    • For example: 0 0 0 ... 1

Creating Data Files for GeneCluster

Data files for use in GeneCluster can be created automatically by a special tool such as WICGR's Res File Creation Tool or manually by standard tools such as Microsoft Excel and text editors. Creating GCT or RES files manually is relatively easy since most spreadsheet and database programs allow you to export your data into a file in a tab-separated format. Once this has been done, you only need to load the file into a text editor or word processor, make the necessary format changes, and save the file as raw text.

This section gives an example of how to create a GCT file when the data has initially been stored in a Microsoft Excel spreadsheet. An example of a data set stored in an Excel spreadsheet is shown below:

Once the basic GCT file has been created as shown above, it is simple to save the data in a tab separated format. First select the Save As... item for the File pull-down menu. Then from the dialog box, select Text (Tab delimited) (*.txt) from the file type popup menu, enter a name for the file, and click the Save button. (A message will warn you that this will only save the active sheet - just ignore it by clicking OK.) The saved file will have the correct format for GeneCluster with the exception that there will be extra tabs at the end of the first two lines in the above example. This can be fixed easily by loading the saved document into Microsoft Word and deleting all of the tabs to the right of the "#1.2" and the "7129 58" in the above example. Once the extra tabs have been deleted, you should save it as a text file. Choose Save As... from the File menu and specify Text Only (*.txt) as the file type by using the popup menu. Enter a file name and press the Save button. To load the file into GeneCluster, you need to rename the file to change the file extension from *.txt to *.gct. A similar procedure can be followed to create RES files using Microsoft Excel and Word.


Data Preprocessing

Once data has been loaded into GeneCluster, it is usually necessary to perform some preprocessing and filtering of the data. To access the data set preprocessing window, select the Preprocess... item from the Dataset pull down menu. This will produce the following processing dialog window:

Within this window, the user can specify a series of filters and when the Apply Filters button is clicked, each filter will be applied in succession to the data set specified in the Dataset to Filter: box.  Clicking the Apply Filters button creates a new data set that is accessible from all areas of the program and the original data set is left unchanged. The Create Name button will automatically generate a name for the new data set so that it can be easily identified in other windows. The first time that filters are applied a new data set name will automatically be created; after that the user will have to tell the program to create new names. The 4 buttons (D, U, X, and XX) above the Filters to Apply window allow the user to edit the displayed list of filters. D and U move the selected filter down and up, respectively, in the order of application specified by the list. The X button deletes the selected filtering action from the list. The XX button deletes all of filter actions from the list. All of the filtering actions are described in the section below and the following section gives an example of a typical filtering used.

Filter Descriptions

This section describes the basic filter operations that are possible within GeneCluster.

  • Panels (e.g., Columns (0-3,5,9):  This filtering action allows the user to specify a subset of columns that are contained in the data set to be included in the filtered data set. GeneCluster recognizes both comma separated lists of column numbers and hyphenated column range specifications.
  • Clip Values:  This preprocessing action allows the user to set minimum and maximum thresholds for the data. Any value in the data set that is less then the value in the Min box is set to the value in the Min box.  Similarly, any value that is greater than the value in the Max box is set to the value in the Max box.
  • Keep if...:  This preprocessing option allows the user to apply a variation filter to the data. This variation filter will remove rows from the data set whose values do not vary greatly. For a given row, minVal is the minimum value in that row and maxVal is the maximum value in that row.  If maxVal/minVal is greater than the specified ratio (3 in this case) and maxVal - minVal is greater than the specified difference (100 in the above dialog), then the row passes the filter. Any rows that do not pass the filter are excluded from the resulting data set.
    • Note there are additional modifiers that allow for outliers in the data set.  For example, if all the values in a row are around 100, then this row will have maxVal / minVal scores around 1, but if there is one column with value 1000, then the maxVal / minVal will be around 10, and the row could pass the variation filter even though this may not be a meaningful variation. One can avoid the problem of outlying data points by using either the Exclude or BMS options.
    • Exclude Option:  The Exclude Option of the variation filter allows the user to trim the ranked list of values before determining the minimum and maximum values. Instead of setting minVal and maxVal to the minimum and maximum values in the row, this option will exclude the top High values and the bottom Low values, and assign the minVal and maxVal from the resulting set. For example, if row_i = (1000, 100, 100, 100, 5), then
    • The base variation filter:  minVal = 5 and maxVal = 1000 and row_i will pass the Max / Min >= 3 criterion and also the Max - Min >= 100 criterion.
    • An Exclude High 1 and Exclude Low 0 filter:  minVal = 5 and maxVal = 100 and so it will pass the Max / Min >= 3 criterion but not the difference criterion, since maxVal - minVal = 95 < 100.
    • An Exclude High 1 and Exclude Low 1 filter:  maxVal = 100 and minVal = 100 and so neither the ratio nor the difference criteria are passed.
    • BMS Freq Filter:  The parameter is the number of columns that must be outside half the fold ratio (Max / Min / 2) of the median of the row values. (Count the number of columns where abs(x - median)/median > minFold/2 and pass row if count is greater than the value in the BMS Freq. Filter box.)
  • Normalize to Mean and Variance: This preprocessing option scales each row to have the specified Mean and Variance  (x_new = (newVar/oldVar) * (x_old - (newMean - oldMean)).
  • Transpose: This preprocessing option transposes the data set and makes rows into columns and vice versa. (RES files usually need to be transposed.)
  • Scale by minimum value: Scales everything to the minimum value.
  • Randomize by bootstrap sampling columns: This preprocessing option randomizes the data set by bootstrap sampling columns from the data set with replacement.
  • Shift global min to: This preprocessing option shifts the values in the data set by adding an offset so that the new minimum value in the data set is equal to the value specified in the box.
  • Shift row mins to: This preprocessing option shifts the values in the row by adding an offset so that the new minimum value in the row is equal to the value specified in the box.
  • x -> Log10(x): This preprocessing option replaces each value in the data set with the base 10 logarithm of the value.


Data Analysis

Once data has been loaded the user can select from a number of types data analysis. This includes unsupervised learning (e.g., SOM clustering), supervised learning (e.g., KNN and WV predictors), and marker analysis.

Marker Analysis

Marker analysis helps the user to determine which genes are most closely correlated with a class and how significant that correlation is for each gene. Marker analysis is started by selecting the Marker Selection / Neighborhood Analysis... item from the Data Analysis pull down menu.

Within the marker class window, the user can select a data set and a class template, and see how well the genes in their data set correlate with the different classes as specified by the template. Recall that the class template describes the class for each sample (corresponding to rows in the transposed data set). Thus the number of rows in the data set must match the number of class assignments in the class template. The genes are assumed to be the columns.

Compute Neighbors (Discrete Metrics): The user chooses the data set and class template, and also the number of genes they want to consider for correlation. The gene ranking method is chosen by selecting either Signal to Noise (S2N) or t-Test from the Distance Function pull down menu. The Signal-to-Noise feature selection method looks at the difference of the means in each of the classes scaled by the sum of the standard deviations: (m 1 - m 2) / (s 1 + s 2) where m 1 is the mean of class 1 and s 1 is the standard deviation of class 1. The t-Test statistic is the same at the Signal-to-Noise except the denominator is (s 12 + s 22)1/2, this would be the statistic used in the T-test. Note that (s 1 + s 2) > (s 12 + s 22)1/2 always; also these two statistics are identical when s 1 = 0 or s 2 = 0. Thus, the Signal-to-Noise statistic penalizes genes that have higher variance in each class more than those genes that have a high variance in one class and a low variance in another. This bias is perhaps useful for biological samples, e.g. in a case of tumor versus normal where in one class, the gene is working normally and regulated relatively strictly, and in the other class the gene is broken and varying more widely. The Class Estimate pull down menu allows the user to chose between either the Median or the Mean as the class estimate in the Distance Function. Clicking the Run button causes the Num. Neighbors most correlated with each class to be displayed for each of the classes on the right half of the panel in the Genes table. The tables has 12 columns which have the following meanings:

  • This column with heading # contains an index for the rows in the table. It is useful when you want to return to the original ordering for the marker analysis results (clicking on the # heading sorts the table by the index).
  • The Class column contains the class label for which the gene in each row is more highly expressed. The class label is either a 0 or 1 or the label specified in the CLS file on the line with the '# class0 class1' information.
  • The Score column contains the absolute value of the signal-to-noise ratio for the row's gene.
  • The Mean0 column specifies the mean of the gene in the class 0 samples.
  • The Std0 column specifies the standard deviation of the class 0 samples.
  • The Mean1 column specifies the mean of the gene in the class 1 samples.
  • The Std1 column specifies the standard deviation of the class 1 samples.
  • The Perm 1 % column contains for each gene the signal-to-noise ratio for the one percent level from the permutation of the class labels.
  • The Perm 5 % column contains for each gene the signal-to-noise ratio for the five percent level from the permutation of the class labels.
  • The Perm (user) column contains for each gene the signal-to-noise ratio for the user set p-value (default 0.5 or 50%) from the permutation of the class labels.
  • The Feature column gives the name for the row's gene where the name comes from the input data file.
  • The Desc column gives the description (if any) of the gene where the description comes from the input data file.

Within the table, the rows are sorted by the class that the markers are correlated with followed by the values of the signal-to-noise ratio. Values can be sorted by any other column by clicking on the column heading.

Permutation Testing: Running the application in this portion of the panel performs a permutation test to assess the significance of the score for each gene. (Refer to: T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M. Loh, J.R. Downing, M.A.Caligiuri, C.D. Bloomfield, and E.S. Lander. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science, Oct. 15, 1999: 531-537. and the supplemental information on the website http://www-genome.wi.mit.edu/cgi-bin/cancer/publications/pub_paper.cgi?mode=view&paper_id=43 for a more complete description of marker permutation testing.) The permutation test can be used to calculate whether the top marker genes with respect to a biologically meaningful phenotype (e.g. morphology) are statistically significant. We do this by comparing the signal-to-noise scores for top marker genes with the corresponding ones for random permutation versions of the class labels (phenotype). This test permutes the class assignments N times (where N corresponds to the value in the Num Perms box). For each permutation, the genes are ranked. Then a histogram of signal-to-noise scores for each rank is built. For example, one histogram for all N top markers (k=1), another histogram for all N second best (k=2), etc. These histograms represent a reference statistic for the best match, second etc. and for a given value of k different genes contribute to it. Notice that the correlation structure of the data is preserved by this procedure. Then for each value of k one determines the 1%, 5% and user set (from the P-Value box) significance levels. This test controls for both the number of genes ranked (the more genes ranked, the greater chance there will be one with a high score to the random template) and also for correlation between genes. Underneath the P-value and Num Perms text boxes in a text item displaying the "Min number permuted classes available:". This item shows the number of different shufflings of the class labels that are available for use in the permutation test (minimum number of permutations available in the multi-class case that is treated as a series of one vs. all tests). The minimum number of permutations is equal to (N1+N2)!/(2*N1!*N2!) where N1 and N2 represent the number of samples in class 1 and 2 respectively. A warning message box will pop up in cases where a small number of samples produce less than 100 possible permutations that may result in less accurate and unstable p-value estimates. In these cases, the user may want to use an analytical method to calculate the p-value based upon checking all possible permutations of the labels.

Compute Neighbors (Continuous Metrics tab): The Continuous Metrics marker analysis performs the nearest neighbors analysis for a particular gene by trying to find other genes whose expression values follow similar trends for the samples. The user can set the number of nearest neighbors to a particular gene by entering a value in the Num. Neighbors box. Currently, there are four choices for the distance metric in the Distance Function pull down menu (the distance function choices are Cosine, Euclidean, Manhattan, and Pearson). The Euclidean distance is given by dE=(S i (xi-yi)2)1/2 where i is the sample number, xi is the named genes expression value, yi is the expression value of the gene we are testing, and we sum over all of the samples. The cosine distance is given by dC=S i (xi * yi) / (S i xi2 * S i yi2)1/2 where i is the sample number, xi is the named genes expression value, yi is the expression value of the gene we are testing, and we sum over all of the samples. The Manhattan distance is given by dM=S i |xi-yi| where i is the sample number, xi is the named genes expression value, yi is the expression value of the gene we are testing, and we sum over all of the samples. The Pearson distance is given by dP=1-abs(r) where r is the Pearson correlation. Gene names or descriptions can be searched by entering a name or keyword in the Gene Name box and clicking the Search button (all genes will be listed if Search is pressed with the Gene Name box left empty). After pressing search, the user selects one of the genes in the list and finds its neighbors by clicking the Run button. This will cause a Genes table to be created in the right half of the panel with three columns: 1) Feature - contains the gene's identifier from the input file, 2) Desc - contains the gene's description from the input file, and 3) Score - contains the calculated score for the gene. The Genes table will be ordered based upon the score.

View Lists: Once the calculation has been run, the user can view either the absolute or relative pink-o-gram (the Show Absolute Color Plot... and Show Relative Color Plot... buttons respectively) for the listed genes or save a text file containing the Genes table (with the Save... button). Clicking either the Show Absolute Color Plot... and Show Relative Color Plot... buttons causes a Color Gram Browser panel to be opened as shown below (this panel is identical in form to the panel created when the Display Data button on the main panel is selected). The functionality of the Color Gram Browser panel is described in detail in the Reports section.

 

Unsupervised Learning (Clustering)

Unsupervised learning allows the user to discover classes in the data set. GeneCluster provides SOM clustering as the unsupervised learning method. Supervised clustering can be performed on either the samples or genes by use of the transpose function. Unsupervised learning is started by selecting the Find Classes... item from the Data Analysis pull down menu.

The Class Finder panel allows the user to run SOM clustering in a batch mode.  The SOM clustering algorithm is described in detail in the paper titled: Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Dmitrovsky, E., Lander, E.S., Golub, T.R. (1999) Interpreting gene expression with self-organizing maps: Methods and application to hematopoeitic differentiation. Proc. Natl. Acad. Sci. USA 96:2907-2912 (http://www.pnas.org/cgi/content/abstract/96/6/2907). Selecting the SOM tab causes the Class Finder panel to use Self Organizing Map clustering. The current SOM implementation requires that the user to specify the number of clusters that the data will be organized into before running. This number of clusters is specified in the Num Clusters: box. A range of values for Num Clusters can be entered and the program will automatically run the algorithm for each number of clusters in the range.  Since the outcome of the clustering algorithm can vary depending on the initial starting point, the user can enter the number of random starting points to try for each cluster size in the Num Seeds: box.  Initial Seed, which is the seed for the random number generator,  is exposed to allow the user to recreate a given session at a later time, (as opposed to allowing the program to generate a random initial seed which could result in different outcomes even if all the other parameters are identical).  Lastly, the user can specify Iterations which tells the algorithm how many times the algorithm should try to refine the clusters.  Initially, this value can be set low for faster exploration, but should be set high (e.g., 50,000) for good convergence.

Basic Parameters

  • Num Clusters:  specifies the range of number of clusters to try finding cluster for.
  • Iterations:  specify how many times to optimize the grouping of data objects.
  • Initial Seed:  the seed for the random number generator.
  • Num Seeds:  the number of random starting points to try for each cluster in the Num Clusters range.


Advanced Parameters

Unless the user is an expert user of SOMs, it is recommended that the advanced parameters be left at their default settings.

  • SOM Rows, SOM Cols:  For the SOM, a given NumClusters (say 12) can represent multiple SOM geometries, e.g. (1x12), (2x6), (3x4).  By default, all of these geometries are tried by the batch algorithm.  The SOM Rows and SOM Cols settings allow the user to override the batch setting and generate a single SOM with the given geometry.
  • Initialization:  The SOM algorithm starts from a set of random centroids.  These centroids can be initialized by:
    • Random Vectors: new vectors are randomly generated.
    • Random Datapoints:  actual datapoints are randomly selected to use as the initial centroids.
  • Neighborhood: The neighborhood function determines how centroids near to the target centroid are updated.
    • Gaussian Neighborhood: All centroids get updated and they are weighted by a Gaussian centered on the target centroid, with a standard deviation of sigma.
    • Bubble Neighborhood: Centroids within sigma get a full update and centroids outside of sigma get no update.
  • Alpha_i, Alpha_f: The initial and final learning weights. Centroid updates are weighted by the learning rate.
  • Sigma_i, Sigma_f: The initial and final sigmas that determine the size of the update neighborhood around the target centroid.

Supervised Learning (Building a Predictor)

Supervised learning allows the user to build and test predictors using training and testing data set. GeneCluster currently provides k nearest-neighbors (KNN) and weighted voting (WV) predictors as supervised learning methods. Selecting the Build Predictor... item from the Data Analysis pull down menu starts supervised learning.

The Build Predictor panel enables the user to use information about the samples in their data set (e.g. tumor type), to guide feature selection (e.g. gene selection) and to build models that will be able to predict the sample labels based on the values of the features. Examples of information associated with a sample are: tumor type, treatment outcome, p53 mutation status, etc. These labels are frequently referred to as dependent variables in machine learning literature. Here, the features are typically genes, but they don't have to be and they are often referred to as independent variables. The objective of the applications accessed through this panel is learning a model that can predict the dependent variables based on the values of the independent variables. In the remainder of this section, features are assumed to be genes.

Inputs Section:

The Inputs section of the Build Predictor panel is used to select loaded data sets and sample labels.  The data sets and class label files need to have been previously loaded from the Open... item from the File pull down menu on GeneCluster's main panel. Any desired or needed preprocessing also needs to have been performed previously. The default item displayed for the Training Set is the most recent data set loaded or preprocessed. The default item displayed for the Class Vector pull down list is the most recently loaded or used CLS file. If there is a mismatch between the number of rows in the selected data set and the number of labels in the CLS file, a warning will be show and the Run button will be disabled. There is also a Preprocess pull down menu in the Input section of the Build Predictor panel that permits the user to do some additional preprocessing. The current choices in the Preprocess pull down menu are Rank which replaces the gene expression values with their rank and Ratio which selects the top N features and forms all possible ratios of expression values among the top N genes. The Cross-Validation Results Base Name box gives the user a place to enter the base part of the name for the cross-validation results object. The full cross-validation object name is given by a combination of the contents of the Cross-Validation Results Base Name box, the number of features, and the model type.

Feature Selection Section:

The Feature Selection section of the Build Predictor panel is used to specify the number of features and the type of feature selection the model will use to try to predict the tumor type. Typically, adding genes potentially adds information so that the model will better be able to learn the mapping from genes to labels, but each additional feature also adds some associated noise. At some point, the amount of noise introduced by adding a feature outweighs the information gain produced by that feature and the accuracy of the models will decrease. In the Features box, the user can either enter a single number of features or a range of features. When entering a range of features, a model will be build for each specific number of features in the range. E.g. if the user entered 5-10, then a 5 feature model would be built, as well as 6, 7, 8, 9, and 10 feature models. A wild card * can also be entered in the Features box, which will build models with features with all models of 2n features with n=0,1,2,... up to and include the maximum number of features. A '+' entered into the Features box, causes the program to run the following set of features:1-10,15,20,30,40,50,100,250. A '++' entered into the Features box, causes the program to run the following set of features: 1-20,15,20,30,40,50,60,70,80,90,100. Entering a number into the Permute Class Vector box causes that number of random permutations of the class vector to be run and summary statistics to be stored in the Histogram Views object on the main GeneCluster panel.

The method of feature selection is controlled in the Feature Selection section of the Build Predictor panel (Currently, S2N (signal-to-noise) is the only option for feature selection). The signal-to-noise feature selection method looks at the difference of the means (or medians) in each of the classes, but scaled by the sum of the standard deviations: (m 1 - m 2) / (s 1 + s 2) where m i is either the mean or median of class i depending on the selection in the Class Estimate pull down menu and s i is the standard deviation of class i. If the denominator were (s 12 + s 22)1/2 and m i was the mean then this would be the statistic used in the T-test. Note that (s 1 + s 2) > (s 12 + s 22)1/2 always; also these two statistics are identical when s 1 = 0 or s 2 = 0. Thus, our statistic penalizes genes that have higher variance in each class more than those genes that have a high variance in one class and a low variance in another. This bias is perhaps useful for biological samples, e.g. in a case of tumor vs. normal where in one class, the gene is working normally and regulated relatively strictly, and in the other class the gene is broken and varying more widely. Checking the Two Sided check box in the S2N tab panel causes the signal-to-noise ratio to chose equal number of features that are highly expressed in both classes otherwise features are ranked according to the absolute value of the signal-to-noise ratio. The Binary Distinction pull down menu on the S2N panel is important only when there are more than two classes in the data set. When trying to predict more than two classes in the data set, the signal-to-noise feature selection method requires that the prediction be broken up into a set of pair wise predictors using either 1 against All (for all classes x, build a model x versus not x) or a All Pairs (for classes x, y, and z, build models of x vs. y, x vs. z, and y vs. z) distinction.

Generation Section

Within the Generation section of the Build Predictor panel, you can set parameters to either Build or Validate a model. Under the Validate tab, the user sets parameters for a cross-validation test of the data. Choosing the Cross Validate radio button, allows the user to select one of three standard cross-validation test methods: Leave 1 Out, 10 Fold, or 4 Fold. Leave-one-out cross-validation cycles through each of N samples in the data set by removing it from the data set, using the remaining N-1 samples for training a model, and testing the model using the held out sample. The overall performance of the model is specified by the total number of errors on the held-out samples. Ten-fold and four-fold cross-validation are similar to leave-one-out cross-validation except the samples are split into ten or four subsets and the algorithm cycles through holding out each of the ten or four subsets of data for testing. Choosing the Dataset Splits radio button, allows the user to set parameters to iteratively split the data into test and training data sets. The NumSplits parameter specifies how many times the data gets split while the Percent parameter sets the percentage of samples that get placed into the training data set. By selecting the Build tab of the Generation section, the user can set the algorithm to create a model that can be tested on a separate data set. This model can be saved or used in the Apply Predictor panel to test the model on a separate data set.

We avoid using training set error as a measure of performance because if we are using the same data to evaluate the model as was used to build it, and face the problem of the error estimate being too low because of over-fitting. That is, if we apply the model to new data, we would not expect to do as well as the training set error. What we are really interested in is the model's test set error, that is the model's performance on unseen data. Many bioinformatics applications have thousands of genes and only a handful of samples. With this type of data sets, it is possible to get a very low training set error, e.g. < 10% error rate, and yet still do no better than chance when using the model to predict classes for new samples. The cross-validation error rate can be used as an approximation to test set error when there are not enough samples to create separate training and test sets. This cross-validation estimate works by generating a series of train-test sets. The first step is to decide on how many samples to hold out for each of the iteration's test set; we will assume leave-one-out cross-validation for this example. (Leave-one-out cross-validations allows you to use the maximum amount of data for training the models while other types of cross validation will more closely approximate the train test results for smaller training data sets.) For example, suppose we have a data set with 10 samples. The program will leave out sample 1, select features and build a model on the remaining samples (2-10), and then try to predict the class of sample 1. It then considers the training set with samples (1, 3-10) and hold out set of (2), etc. The errors on each hold out set are summed to produce the leave-one-out cross-validated error rate. Within cross-validation, feature selection is performed separately for each sample held out. Since supervised feature selection (the class labels are used as part of the process for choosing relevant genes) is used, using all the samples to select features, and then performing cross-validation, would create an "information leak". That is, knowledge about a held out sample would have been used to help select features, and even though we didn't use the sample to fit the model parameters, there is still a possible large bias introduced on the error rate. For our typical data sets, we could again get < 10% cross-validated error when using all the samples to choose features, and still have random performance on the test set. There is a separate model being built for each sample that is left out, each with a possible different set of genes. So the cross-validated error rate is an average across different models. An alternate approach would be to select a fixed set of features using some other knowledge, and then have all the models use this fixed feature set. The features would have to be chosen without knowledge of the class labels. Finally, note that with cross-validation the successive models are not independent. For example, two samples may be similar enough so that if one is in the training set, then the other will always be predicted correctly in the test set. Thus, using a binomial model as a null hypothesis for the cross-validated error rate will tend to overestimate the statistical significance of the cross validated error. This is because the distribution of the cross-validated error has higher variance (more weight in the tails) than the binomial distribution.

Algorithm Section:

Within the Algorithm section of the Build Predictor panel, the user can choose the prediction algorithm by selecting one of the tabbed panels. The algorithms generally assume the genes are independent and their accuracy depends on how well the individual gene expression levels correlate with each class. Currently there are two algorithm choices: KNN and Weighted Voting. Selecting the KNN tab set the algorithm to the K- Nearest Neighbor algorithm. The basic KNN algorithm essentially memorizes the training set, and then when a new point is presented, it looks at the K closest points from the training set, and classifies the point as belonging to whichever class has the majority among these K points. E.g. for a K = 5 classifier, trained on two classes, (A, B) if the first 5 neighbors contained 3 points from A and 2 from B, then the new point would be put in class A. There is a huge literature on this type of classifier, and it has an interesting property of what Vapnik calls a transduction algorithm. No explicit model for the probability density of the classes are formed, each point is estimated locally from the surrounding points. The user of the KNN algorithm has several algorithmic choices. The Number of Neighbors box allows the user to set the value for K. The Distance pull down menu selects the distance metric that determines which points are closest (currently the only choice is the Cosine distance measure). The Weighting Method pull down box allows the user to choose a method of giving weight to the class of each of the K neighbors based upon the neighbor's characteristics. Weighting method choices are none (gives all K neighbors equal weights), By 1/k (weighs neighbors by the reciprocal of the rank of the neighbor's distance (e.g., the closest neighbor is given weight 1/1, next closest neighbor is given weight 1/2, etc.)), and By distance (weighs neighbors by the reciprocal of the distance). Selecting the Weighted Voting tab chooses the weighted voting method of classification. This algorithm is very similar to a naïve Bayes algorithm, i.e. each input (gene) is assumed independent and each contributes a weight or vote for a class; the class receiving the greatest number of votes is the predicted class.  The user can select from the Preprocessing pull down menu None, Log10 (logarithm base 10 of the expression values), or Log10NormFeatures (logarithm base 10 of the expression values followed by subtracting the mean log expression value and dividing by the standard deviation of the log expression value). The user should consider log transforming the data before using this algorithm. The weighted voting algorithm finds the decision boundaries half way between the class means: bx = (mclass0 + mclass1)/2 for each gene in the feature set. To predict the class of a test sample y, each gene x in the feature set casts a vote: Vx = Sx (gx - bx) where gx is expression value of gene x and Sx is the signal-to-noise ratio for gene x and the final vote for class 0 or 1 is sign (Sx Vx). The strength or confidence in the prediction of the winning class is (Vwin-Vlose)/(Vwin+Vlose) (i.e., the relative margin of victory for the vote). (Refer to: T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M. Loh, J.R. Downing, M.A.Caligiuri, C.D. Bloomfield, and E.S. Lander. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science, Oct. 15, 1999: 531-537. and the supplemental information on the website http://www-genome.wi.mit.edu/cgi-bin/cancer/publications/pub_paper.cgi?mode=view&paper_id=43 for a more complete description of the weighted voting method.)

Applying a Predictor

If a predictor model has been saved, one can apply it to a test set. This is done by selecting the Apply Predictor... item from the Data Analysis pull down menu.

Within the Apply Predictor window, the user needs to select the test data set, the predictor to apply to the test data set, and a class assignments set for the test data set. The predictor is run on the test data set by clicking on the Run button. Results from applying the predictor to the test data set can be viewed in the same way as the cross-validation testing results (see Prediction Results section below).

Evaluating a Predictor

Once a prediction has been made, it can be evaluated with the Fisher Test using the Evaluate Predictor panel. Selecting the Evaluate Predictor... item from the Data Analysis pull down menu will open the following panel for performing the Fisher Test.

This panel performs a Fisher test on the contingency table (confusion matrix) on a set of prediction results. This is one method of evaluating the significance of a set of prediction results.


Reporting

This section describes several different panels for visualizing and saving data and results.

Display Data

A number of data objects internal to GeneCluster can be viewed and/or saved from the main GeneCluster panel. For example, clicking on a data set produces the panel shown below. From this panel, the user can Export Dataset..., Display Data (color scaled by row)... in a color gram browser, Display Data (color scaled globally)... in a color gram browser, or display the Correlation... between samples. Both of the Display Data windows bring up Color Gram Browser windows like those described below in the Color Gram Browser Section.

Other objects allow other types of data to be displayed and or saved such as Predictor models, sets of Prediction Results and Histogram Views of permutation tests.

Color Gram Browser

Data sets and results from the marker selection / neighborhood analysis can be viewed using the color gram browser panel. To start the color gram browser click on either the Display Data (color scaled by row) ... or the Display Data (color scaled globally) ... button for a selected data set in the main GeneCluster panel as shown below.

Or alternatively, the color gram browser window can be opened from the Marker Selection / Neighborhood Analysis panel by clicking on the Show Absolute Color Plot ... or the Show Relative Color Plot... button as shown below.

Clicking on the Show Relative Color Plot... button as shown above produces the following Color Gram Browser panel.

There is a choice of six tab panels on the bottom of the Color Gram Browser panel: a Display Options panel, a Export Data panel, a Sort Rows and Columns panel, a Grids and Grid Sizes panel, a Legend panel, and a Cell Information panel. Shown above is the color gram panel with the Display Option tab panel selected. Within the color gram display, cells and labels can be selected, copied, and pasted into other documents. Individual cells, column names, or row names can be selected by clicking on them with the left mouse button and ranges of cells, column names, or row names can be selected by left clicking on a cell while holding down the shift key. The selected cells can be copied to the clipboard with by pressing <CTRL>+C and subsequently pasted into Excel worksheets or other documents. Copying cells from the color gram will result in a paste operation of a table the expression values for the selected cells rather than the colored squares. Within the Display Option tab panel, the user has a number of options regarding the appearance of the displayed color gram. On the Show Labels subpanel, the user can remove row and/or column labels from the color gram display by unchecking the corresponding check box. On the Color Scheme subpanel, the user can choose between radio buttons for Relative and Absolute color schemes where with the relative choice causes the colors for expression values to be scaled as a function of the number of standard deviations relative to the mean and the absolute choice uses an absolute color scale. When choosing the Absolute color scheme, a Logarithmic color response usually produces better color grams. Within the Data Orientation subpanel, the user can alter the orientation of the color gram by clicking on the Transposed radio button. Within the Enable Fly-over Text subpanel, the user can turn off the floating fly-over text that appears as the user mouses over the color gram by unchecking the In Color Grid check box. This fly-over text displays column and row names and descriptions and expression values for the cell that the cursor is currently pointing to.

The Color Gram Browser panel with the Export Data tab panel selected is shown below.

On the right side of the Export Data tab panel is an Export Image subpanel. The Export Image subpanel is used to save the color gram image to a file. The Use format pull down menu lets the user select the type of image file to save with the current choices being PNG, png, pnm, jpeg, tiff, and bmp. The PNG and the jpeg both save fairly compact files and can be displayed by a large number of programs (including the Microsoft Office programs and most web browsers). Clicking on the Save Image... button causes a file browser panel to pop where the user can select the directory to save the file and enter the name for the saved image file. Saving an image also causes an image of the legend to be saved in the same directory where the legend will be saved using the same format in a file with the name of the color gram file with an "_image" appended. On the left side of the Export Data tab panel is a Create Datasets from selected features subpanel. The Create Datasets from selected features allows the user to create new data sets for use within GeneCluster (or for exporting) that contain only a subset of the features and/or samples of the original data set. This capability is especially useful when used in conjunction with the sorting feature described below. To use this feature, the user first selects the desired rows and columns either by clicking on them in the color gram image or by checking the box in the Use all labels box. Then the user types a name in the Name: box and clicks on the Create New Dataset button. This data set will then be available for use by GeneCluster's other tools and for export outside GeneCluster.

The Color Gram Browser panel with the Sort Rows and Columns tab panel selected is shown below.

The Sort Rows and Columns tab panel allows the user to sort columns and rows (samples and genes) based upon string matches in the name and description fields. This tab panel also allows the user to manually assign categories to rows or columns that can be used for sorting. Clicking on the Sort by rows... button causes the following window to pop up.

From the sorting window, the user can either sort the feature names or descriptions directly or a more interesting option is to click on the Categorize column... button to create a set of categories based upon the feature names or descriptions. Clicking on the Categorize column... button opens the following Categorize window.

The user can then enter in keywords to search for in a particular column or columns. These keywords should be assigned category numbers that can subsequently be sorted. For example, we can search all columns for the keyword combinations "chloride channel" (class 1), "zinc finger" (class 2), "transcription factor" (class 3), and "protein" (class 4). (Items in the column that do not get a category assignment by the keyword search default to a category 0 during the sort). These keywords and categories are entered in the panel below.

Multiple word keywords are searched using an "and" function with all words in the keyword list required in the specified column for a row to constitute a match. The user can make the search case sensitive by unchecking the Case-insensitive search check box on the Categorize panel. Colors were assigned to categories in the example (for display with the color gram) by clicking in a cell in the Color column of the Categorize panel. Selecting a cell in the Color column brings up the Pick a Color color selection window shown below.

Colors are selected by clicking on a color swatch from either the main palette or from recent colors palette. Alternate methods of color selection are available in the HSB and RGB tab panels. Once a color has been selected, the user should click on OK to accept the color selection.

After a name has been supplied in the New header label text box on the Categorize panel, the user can click on the Apply button to perform the keyword search on the selected column. Clicking on the Apply button for the above example produces the following updated version of the sort panel.

This added a new column to the table displaying the results of the keyword seach with the name that was supplied in the New header label text box. We can now sort the data using this category label column by clicking on the Sort Table… button. Clicking the Sort Table… button produces the following panel.

The Sort panel is organized and operates similarly to the sort panel within Microsoft Excel. On the Sort panel, the user can choose the columns to use when sorting the rows. One can sequentially sort by several columns each of which can contain different sets of categories. Columns to be sorted by are selected by the pull down boxes which list all valid columns for sorting. The rows can be sorted in either Ascending or Descending order by choosing the appropriate radio button. Depending upon the data type the user can sort Alphabetically, Numerically, chronologically (Date / Time), or by Index. In the above panel, we have set the parameters to sort the data first in descending order using the keyword categories and second by the accession numbers in ascending order. This sorting produces the following reordered feature table.

Clicking on the Apply button in the above panel causes GeneCluster to reorder the color gram with a color bar displaying the category assingments as shown in the figure below.

Once the rows have sorted, the user can easily create a new data set with just the sorted rows or save the color gram image by going to the Export Data tab. The color legend for the sorted categories can be viewed by clicking on the Legends tab. Column sorting (from clicking on the Sort by columns… button) behaves similarly to the row sorting operation. For example, clicking on the Sort by columns… button pops up the following panel with a list of sample names.

Where we can click on the Categorize column… button which pops up a window Categorize window again that is similar to the Categorize window for rows. But this time we will use the Manual categorize tab operations. Under the Manual categorize operations the user can type in categories for each object or, as we did in this case, can import a cls file by clicking on the Import CLS… button. In this case, the outcome cls file was imported but data from any cls style file with any number of classes could be imported to display categories for samples. Colors can also be assigned to categories here. We assigned the color green to samples from cured patients and the color red to samples from fatal / refactory patients. As an alternative to importing a cls file or typing in all of the category labels, the user can copy a range of cells from an Excel worksheet and paste them into the category column using the CNTL + V short cut key.

Once the categories have been manually assigned the Add Column button can be clicked to add the column to the sort data set panel as shown below.

Since the samples are already sorted according to the cured versus fatal / refractory distinction, there is no need to re-sort based upon the sample_outcome column. We can just apply the categories to the color gram by clicking on the Apply button. This produces the following color gram display.

The Color Gram Browser panel with the Grids and Grid Sizes tab panel selected is shown below.

The Grid and Grid Sizes tab panel allows the user to make changes to the appearance of the color gram image. Within the Grid and Grid Sizes tab panel, the user can make changes to the appearance such as changing the size of the color gram with the Adjust the size of the cells slider, change the outline color for selected cells in the Selection Color subpanel, and change whether grids are displayed and the color of the grid in the Show Grids subpanel.

The Legends tab panel of the Color Gram Browser will display a color legend for the color gram and is exported with the color gram when it is saved. The Cell Information tab panel of the Color Gram Browser will display the name and description of the row and column and the expression value for the cell that the cursor is currently pointing to. 

View Clusters

Once a data set has been clustered with the Find Classes algorithms, the results can be viewed with the View Clusters panel. This is done by selecting the View Clusters... item from the Data Views pull down menu.

To display a set of clustering results, one selects a set of clustering results from the Find Classes panel using the Clustering pull down menu and then clicking the Compute View button. This view will have a View Type format of Data Means, SOM Centers, or Both where the type is chosen from the View Type pull down menu. Once a view has been computed, it can be selected from the Cluster View pull down menu at the top of the panel. Details about the clusters are contained in the middle of the panel. A particular cluster can be selected for viewing by clicking on one of the plots on the left side of the panel (which will then be highlighted by a yellow box around it). The right side of the panel will then display the samples contained in the cluster and their distance from the cluster centroid.

Cluster Summary

Once a data set has been clustered with the Find Classes algorithms, summaries of the clustering results can be viewed with the Cluster Summary panel. This is done by selecting the Cluster Summary... item from the Data Views pull down menu.

This panel shows a summary of the clustering results for each of GeneCluster's SOM Clusters objects, which are produced by running the Find Classes algorithm. The first two columns show the geometry of nodes for the clusters in the feature space. The Variance Explained column indicates how much of the sample variance can be explained by the organizing the samples into the discovered clusters. The next three columns give the average, minimum, and maximum separation between clusters.

Prediction Results

The Prediction Results panel is used to view results from testing a predictor where the results can come from either a separate test set (using the Apply Predictor panel) or a cross-validation prediction (from the Build Predictor panel). The Prediction Results panel is started by selecting the Prediction Results... item from the Data Views pull down menu.

The prediction results and the model used to obtain them are summarized in the top portion of the panel. Details of the prediction results for each of the samples are shown in the table at the bottom of this panel. The first column in the table gives the sample name (Datapoint). The second column gives the class number for the Predicted Class. The third column gives the Confidence associated with the predicted class (see algorithms subsection in the Build Predictor section for a description of it calculation). The fourth column gives the True Class from the class template file and the fifth column displays an asterisk (*) wherever there is an Error in the prediction and NC wherever no call is made by the predictor. The confidence threshold for making a call and potentially errors can be adjusted by entering a number into the No Calls Threshold box and pressing the Update Calls button. A set of prediction results is selected for display by choosing an item from the pull down menu above the Datapoint column or by stepping through the prediction results objects list with the left (<-) and right (->) arrows. The results from the displayed prediction can be saved by clicking the Save... button that brings up a save dialog box. The confusion matrix, which displays a table that summarizes the predicted classes against the true classes, for the prediction can be displayed by clicking the Confusion Matrix... button. Clicking the Save All... button causes all of the prediction objects to be saved together in a file.

Feature Summary

The Feature Summary panel is used to view lists of features used to build predictors using cross validation. The Feature Summary panel is started by selecting the Feature Summary... item from the Data Views pull down menu.

The first column shows a list of features using the names from the loaded data set. Each of the remaining columns corresponds to one of the samples in the input data set. Below each sample, a non-zero number in a particular row means that row's feature was used in the cross-validation model used for testing the column's sample. The number itself represents the calculated signal-to-noise ratio for that feature in the cross-validation training set. The pull down box on the upper right of the panel allows the user to select the set of feature results to display (corresponding to each predictor). When performing a type of cross-validation testing other than leave-one-out (i.e., 4-fold or 10-fold), the signal-to-noise values and features will be displayed only for the first sample of the cross validation test set (the other samples in the cross validation test set will have all zeros displayed in their columns).


Lymphoma Outcome Prediction Example

This section gives an example that uses GeneCluster to predict treatment outcome for a set of diffuse large B-cell lymphoma samples. This example uses publicly available data associated with the paper: Shipp et al., "Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning," Nature Medicine, vol. 8, no. 1, January 2002, pp. 68-74. The data can be obtained from the supplemental information web site http://www-genome.wi.mit.edu/cgi-bin/cancer/publications/pub_paper.cgi?mode=view&paper_id=66 and the paper can be obtained from http://www.nature.com/cgi-taf/dynapage.taf?file=/nm/journal/v8/n1/index.html. Begin by downloading the RES and CLS files from supplemental web (http://www-genome.wi.mit.edu/mpr/publications/projects/Lymphoma/lymphoma_8_lbc_outcome_rn.res and http://www-genome.wi.mit.edu/mpr/publications/projects/Lymphoma/lymphoma_8_lbc_outcome.cls respectively) and saving them to a directory accessible by your local machine. Once the GeneCluster application has been started, you need to load the raw expression data contained in the RES file and the class labels in the CLS file. Both of these files can be opened through the file open dialog that is obtained by selecting Open... item from the File pull down menu on GeneCluster's main panel. Selecting the Open... item from the File pull down menu produces the following file Open window:

Use the Look in pull down menu in the Open window to navigate to the directory where you saved lymphoma_8_lbc_outcome_rn.res and lymphoma_8_lbc_outcome.cls files, select the files by clicking on their name, and open them by pressing the Open button. After the files have been opened, you need to preprocess the data. For this example, we first added thresholds of a minimum expression value of 20 and maximum of 16,000 by entering 20 into the Clip Values to Min box and 16000 in the AND Max box and clicking the Add button next to it. Next, we added variation filters that filtered out row with less than 3-fold variation and 100 absolute difference by entering 3 into the Keep if Max/Min >= box and 100 into the Max-Min >= box and clicking the Add button next to it. Finally, we added a transpose operation to the preprocessing so that genes are in columns and samples are in rows by clicking on the Add button next to Transpose Dataset (reset panels). The selected preprocessing operations are carried out by pressing the Apply Filters button. The resultant data set has 58 rows by 6149 columns and can be referenced by the name lymphoma__th_exc_tr_0 in other panels.

The next step is to use supervised learning available from GeneCluster's Build Predictor panel. Supervised learning is started by selecting the Build Predictor... item from the Data Analysis pull down menu.

This example will demonstrate how to use the Build Predictor tool in GeneCluster to build a thirteen gene weighted voting based predictor that uses a mean-based signal-to-noise and tests the model using leave-one-out cross-validation. When you open the Build Predictor panel, the last preprocessed data set (or opened data file) and last opened CLS file will be displayed in the Inputs section of the panel. If a different Dataset or Class Vector is displayed, use the pull down menus to select the preprocessed lymphoma data set and the lymphoma CLS file. In the Feature Selection section of the panel, set the number of features to 13 and the Class Estimate pull down menu to Mean. In the Generation section of the panel, choose the Validate tab and the Cross Validate radio button. In the Cross Validate pull down menu, choose Leave 1 Out for leave-one-out cross-validation. In the Algorithm section of the panel, choose the Weighted Voting tab. To build the model, click on the Run button on the bottom of the panel. The blue bar on the bottom of the panel will display the percent completion for the run. When the build is complete, the results can be displayed by opening the Prediction Results panel. The Prediction Results panel is started by selecting the Prediction Results... item from the Data Views pull down menu.

The prediction results panel will show the cross-validation prediction results. Make sure you are showing the correct set of results by selecting the results object from the pull down menu that displays xval_13_WeightedVoting_0 in the example above. The name for the result set comes from the Cross Validation Base Results Name box in the Build Predictor window, followed by the number of features, followed by the model type, and finally followed by a unique index number. This predictor in the lymphoma data set made a total of 14 errors and the samples with incorrect predictions can be seen in the table at the bottom of the panel. The user can also see the features used to build the predictor by opening the Feature Summary panel. . Selecting the Feature Summary... item from the Data Views pull down menu starts the Feature Summary panel.

The features used to build the cross validation model are shown in this table. The features are ordered by the number of cross-validation models that use the particular feature (with greatest number first). The first seven genes for the lymphoma distinction are used in all 58 cross-validation models. This table shows the features used for each of the cross-validation models and the associated signal-to-noise ratios for each of the genes. (A summary of the frequencies that each gene gets used is contained in a file called feat.txt that gets written into GeneCluster's home directory every time a predictor is built (so it will be overwritten anytime a new model is built).)

Glossary
range: a list of numbers separated by commas and dashes as follows:

  • 3-5: the list contains (3, 4, 5)
  • 2,5,7: the list contains (2, 5, 7)
  • 2-5,8,10-13: the list contains (2, 3, 4, 5, 8, 10, 11, 12, 13)

Further Information

GeneCluster applications are described in Cancer Program publications . The current Java version was engineered by Keith Ohm and Michael Angelo.
Cancer Program  |  Publications/Projects  |  Datasets  |  Software  |  Relevant Links  |  People  |  Directions  |  Software
Webmaster
Last modified: Tue Dec 27 15:16:47 EST 2005  
Broad Institute Home   |  
Contact Us   |   Related Links