clValid               package:clValid               R Documentation

_V_a_l_i_d_a_t_e _C_l_u_s_t_e_r _R_e_s_u_l_t_s

_D_e_s_c_r_i_p_t_i_o_n:

     'clValid' reports validation measures for clustering results.  The
     function returns an object of class '"clValid"', which contains
     the clustering results in addition to the validation measures. 
     The validation measures fall into three general categories:
     "internal", "stability", and "biological".

_U_s_a_g_e:

     clValid(obj, nClust, clMethods = "hierarchical", validation =
     "stability", maxitems = 600, metric = "euclidean", method = "average",
     neighbSize = 10, annotation = "entrezgene", GOcategory = "all", goTermFreq=0.05, ...)

_A_r_g_u_m_e_n_t_s:

     obj: Either a numeric matrix, a data frame, or an 'ExpressionSet'
          object.  Data frames must contain all numeric columns.  In
          all cases, the rows are the items to be clustered (e.g.,
          genes), and the columns are the samples.

  nClust: A numeric vector giving the numbers of clusters to be
          evaluated.  e.g., 4:6 would evaluate the number of clusters
          ranging from 4 to 6.

clMethods: A character vector giving the clustering methods. Available
          options are "hierarchical", "kmeans", "diana", "fanny",
          "som", "model", "sota", "pam", "clara", and "agnes", with
          multiple choices allowed.

validation: A character vector giving the type of validation measures
          to use.  Available options are "internal", "stability", and
          "biological", with multiple choices allowed.  

maxitems: The maximum number of items (rows in matrix) which can be
          clustered.

  metric: The metric used to determine the distance matrix.  Possible
          choices are "euclidean", "correlation", and "manhattan".

  method: For hierarchical clustering ('hclust' and 'agnes'), the
          agglomeration method used.  Available choices are "ward",
          "single", "complete", and "average".

neighbSize: For internal validation, an integer giving the neighborhood
          size used for the connectivity measure.

annotation: For biological validation, either a character string naming
          the Bioconductor annotation package for mapping genes to GO
          categories, or a list with the names of the functional
          classes and the observations belonging to each class.

GOcategory: For biological validation, gives which GO categories to use
          for biological validation.  Can be one of "BP", "MF", "CC",
          or "all".

goTermFreq: For the BSI, what threshold frequency of GO terms to use
          for functional annotation.

     ...: Additional arguments to pass to the clustering functions.

_D_e_t_a_i_l_s:

     This function calculates validation measures for a given set of
     clustering algorithms and number of clusters.  A variety of
     clustering algorithms are available, including hierarchical,
     self-organizing maps (SOM), K-means, self-organizing tree
     algorithm (SOTA), and model-based. The available validation
     measures fall into the three general categories of "internal",
     "stability", and "biological".  A brief description of each
     measure is given below, for further details refer to the package
     vignette and the references.

     *_I_n_t_e_r_n_a_l _m_e_a_s_u_r_e_s:* The internal measures include the
          connectivity, and Silhouette Width, and Dunn Index.  The
          connectivity indicates the degree of connectedness of the
          clusters, as determined by the k-nearest neighbors.  The
          'neighbSize' argument specifies the number of neighbors to
          use. The connectivity has a value between 0 and infinity and
          should be minimized. Both the Silhouette Width and the Dunn
          Index combine measures of compactness and separation of the
          clusters.  The Silhouette Width is the average of each
          observation's Silhouette value.  The Silhouette value
          measures the degree of confidence in a particular clustering
          assignment and lies in the interval [-1,1], with
          well-clustered observations having values near 1 and poorly
          clustered observations having values near -1.  See the
          'silhouette' function in package 'cluster' for more details. 
          The Dunn Index is the ratio between the smallest distance
          between observations not in the same cluster to the largest
          intra-cluster distance.  It has a value between 0 and
          infinity and should be maximized.

     *_S_t_a_b_i_l_i_t_y _m_e_a_s_u_r_e_s:* The stability measures are a special version
          of internal measures which evaluate the stability of a
          clustering result by comparing it with the clusters obtained
          by removing one column at a time. These measures include the
          average proportion of non-overlap (APN), the average distance
          (AD), the average distance between means (ADM), and the
          figure of merit (FOM).  The APN, AD, and ADM are all based on
          the cross-classification table of the original clustering
          with the clustering based on the removal of one column.  The
          APN measures the average proportion of observations not
          placed in the same cluster under both cases, while the AD
          measures the average distance between observations placed in
          the same cluster under both cases and the ADM measures the
          average distance between cluster centers for observations
          placed in the same cluster under both cases.  The FOM
          measures the average intra-cluster variance of the deleted
          column, where the clustering is based on the remaining
          (undeleted) columns.  In all cases the average is taken over
          all the deleted columns, and all measures should be
          minimized.  

     *_B_i_o_l_o_g_i_c_a_l _m_e_a_s_u_r_e_s:* There are two biological validation
          measures, the biological homogeneity index (BHI) and
          biological stability index (BSI).  The observations are
          typically taken to represent a `gene' (e.g., ORF, SAGE tag,
          affy ID).  The BHI measures the average proportion of gene
          pairs that are clustered together which have matching
          biological functional classes. The BSI is similar to the
          other stability measures, but  inspects the consistency of
          clustering for genes with similar biological functionality.
          Each sample is removed one at a time, and the cluster
          membership for genes with similar functional annotation is
          compared with the cluster membership using all available
          samples.

          For biological validation, the user has two options. The
          first option is to explicity specify the functional
          clustering of the genes via a named list. Each item in the
          list corresponds to a functional class, and contains a list
          of genes which are associated with that function. The second
          option is to specify the appropriate annotation package from
          Bioconductor (<URL: http://www.bioconductor.org>) and GO
          terms to determine the functional classes of the genes.  To
          use the second option requires the 'Biobase', 'annotate', and
          'GO' packages from Bioconductor, in addition to the
          annotation package for the particular data type (these will
          not be automatically loaded when 'clValid' is loaded).

          The 'GOcategory' options are "MF", "BP", "CC", or "all",
          corresponding to molecular function, biological process,
          cellular component, and all of the ontologies.   .in -5 

_V_a_l_u_e:

     'clValid' returns an object of class '"clValid"'.  See the help
     file for the class description.

_N_o_t_e:

     Unless the the list of genes corresponding to functional classes
     is prespecified, to perform biological clustering validation will
     require the 'Biobase', 'annotate' and 'GO' packages from
     Bioconductor, and in addition the annotation package for your
     particular data type.  Please see <URL:
     http://www.bioconductor.org> for installation instructions.

     Further details of the validation measures and instructions in use
     can be found in the package vignette.

_A_u_t_h_o_r(_s):

     Guy Brock, Vasyl Pihur, Susmita Datta, Somnath Datta

_R_e_f_e_r_e_n_c_e_s:

     Datta, S. and Datta, S. (2003). Comparisons and validation of
     statistical clustering techniques for microarray gene expression
     data. Bioinformatics 19(4): 459-466.

     Datta, S. and Datta, S. (2006). Methods for evaluating clustering
     algorithms for gene expression data using a reference set of
     functional classes. BMC Bioinformatics 7:397.

     Handl, J., Knowles, K., and Kell, D. (2005). Computational cluster
     validation in post-genomic data analysis. Bioinformatics 21(15):
     3201-3212.

_S_e_e _A_l_s_o:

     For a description of the class 'clValid' and all available methods
     see 'clValidObj' or 'clValid-class'.

     For help on the clustering methods see 'hclust' and 'kmeans'  in
     package 'stats',  'agnes', 'clara',  'diana', 'fanny', and 'pam'
     in package 'cluster', 'som' in package 'kohonen', 'Mclust' in
     package 'mclust', and 'sota' (in this package).

     For additional help on the validation measures see 'connectivity',
       'dunn', 'stability',  'BHI', and  'BSI'.

_E_x_a_m_p_l_e_s:

     data(mouse)

     ## internal validation
     express <- mouse[1:25,c("M1","M2","M3","NC1","NC2","NC3")]
     rownames(express) <- mouse$ID[1:25]
     intern <- clValid(express, 2:6, clMethods=c("hierarchical","fanny","model"),
                       validation="internal")

     ## view results
     summary(intern)
     optimalScores(intern)
     plot(intern)

     ## stability measures
     stab <- clValid(express, 2:6, clMethods=c("hierarchical","fanny","model"),
                     validation="stability")
     optimalScores(stab)
     plot(stab)

     ## biological measures
     ## first way - functional classes predetermined
     fc <- tapply(rownames(express),mouse$FC[1:25], c)
     fc <- fc[-match( c("EST","Unknown"), names(fc))]
     bio <- clValid(express, 2:6, clMethods=c("hierarchical","fanny","model"),
                    validation="biological", annotation=fc)
     optimalScores(bio)
     plot(bio)

     ## second way - using Bioconductor
     if(require("Biobase") && require("annotate") && require("GO") && require("moe430a")) {
       bio2 <- clValid(express, 2:6, clMethods=c("hierarchical","fanny","model"),
                       validation="biological",annotation="moe430a",GOcategory="all")
       optimalScores(bio2)
       plot(bio2)
     }

