
Benchmarking Cluster Analysis Methods
The following 27 Clustering algorithms will be evaluated here on sevel datasets of the Fundamental Clustering Problem Suite (FCPS) and additional high-dimensional data sets. The project is finished and the research is published in [Thrun/Ultsch, 2020].
100 trials per algorithm and data set are calculated. All datasets have uniquely unambiguously defined class labels defined by domain experts or artificially. The error rate is defined as 1-Accuracy (Eq. 3.1 on p. 29 [Thrun, 2018]) was is used as a sum over all true positive labeled data points by the clustering algorithm. The best of all permutation of labels of the clustering algorithm regarding the accuracy was chosen in every trial, because the labels are arbitrarily defined by the algorithms. The following clustering algorithms are used:
Hierarchical Clusterings:
Hepta
The Hepta data set [Ultsch, 2003a] is used to illustrate the general problems with quality measures (QMs) and projections from the perspective of structure preservation. The three-dimensional Hepta data set consists of seven clusters that are clearly separated by distance, one of which has a much higher density. The data set consists of 212 points, comprising seven clusters of thirty points each plus two additional points in the center cluster. The centroids of the clusters span the coordinate axes of R3 The density of the central cluster is almost twice as high as the density of the other six clusters. The structure of the data set is clearly defined by distances and is compact.
Chainlink
The Chainlink data set [Ultsch, 1995; Ultsch et al., 1994] consists of two clusters in R3. Together, the two clusters form intricate links of a chain, and therefore, they cannot be separated by linear decision boundaries [Herrmann, 2011, pp. 99-100]. The rings are cohesive in R3; however, many projections are not. This data set serves as an excellent demonstration of several challenges facing projection methods: The data lie on two well-separated manifolds such that the global proximities contradict the local ones in the sense that the center of each ring is closer to some elements of the other cluster than to elements of its own cluster. The two rings are intertwined in R3 and have the same average distances and densities. Every cluster contains 500 points.
Atom
The Atom data set [Ultsch, 2005c] consists of two clusters in R3. The first cluster is completely enclosed by the second one and, therefore, cannot be separated by linear decision boundaries. Additionally, both clusters have different densities and variances. The Atom data set consists of a dense core of 400 points surrounded by a well separated, but sparse hull of 400 points. Both clusters are not linearly separable and many algorithms cannot construct a cohesive projection. The core is located in the center of the hull, which, for some methods based on averaging, makes it hard to solve it. The density of the core is much higher than the density in the hull. For data in the hull, some of the inner-cluster distances are bigger than the distance to the other clusters.
EngyTime
The EngyTime data set [Baggenstoss, 2002] contains 4,096 points belonging to two clusters in R2; the data set is typical for sonar applications with the variables “Engy” and “Time” as a twodimensional mixture of Gaussians. The clusters overlap, and cluster borders can be defined only by using density information. There is no empty space between the clusters.
Lsun3D
The Lsun3D data set consists of three well-separated clusters and four outliers in R3; it is based on the two-dimensional Lsun data set of Moutarde and Ultsch [Moutarde/Ultsch, 2005]. Two of the clusters contain 100 points each, and the third contains 200 points. “The inter-cluster minimum distances, however, are in the same range as or even smaller than the inner-cluster mean distances” [Moutarde/Ultsch, 2005, p. 28]. The data set consists of 404 data points and was not preprocessed.
Target
The Target data set [Ultsch, 2005c] consists of two main clusters and four groups of four outliers each. The first main cluster is a sphere of 363 points, and the second cluster is a ring around the sphere and consists of 395 points. The data set as a whole consists of 770 points in R3. The main challenge of this data set is the four groups of outliers in the four corners.
Tetra
The Tetra data set, which is part of the FCPS, consists of 400 data points in four clusters in R2 that have large intracluster distances [Ultsch, 2005c]. The clusters are nearly touching each other, resulting in low intercluster distances.
TwoDiamonds
“The data consists of two clusters of two-dimensional points. Inside each “diamond” the values for each data point were drawn independently from uniform distributions” [Ultsch, 2003c, p. 8]. The clusters contain 300 points each. “[In] [e]ach cluster[, the] points are uniformly distributed within a square, and at one point the two squares almost touch. This data set is critical for clustering algorithms using only distances” [Moutarde/Ultsch, 2005, p. 28].
WingNut
“The Wing Nut dataset […] consists [of] two symmetric data subsets of 500 points each. Each of these subsets is an overlay of equal[ly] spaced points with a lattice distance of 0.2 and random points with a growing density in one corner. The data sets are mirrored and shifted such that the gap between the subsets is larger than 0.3. Although there is a bigger distance in between the subsets than within the data of a subset, clustering algorithms like Kmeans parameterized with the right number of clusters (k=2) produce classification errors” [Moutarde/Ultsch, 2005, pp. 27-28].
Breast cancer
The Breast Cancer dataset was taken from the CRAN package “mlbench”. It consists of 9 nominal scaled fea-tures with 699 cases having either benign or malignant tumor cells. The samples arrived periodically as Dr. Wolberg reports his clinical cases [41]. Each variable except for the first was converted into 11 primitive nu-merical attributes with values ranging from 0 through 10. There are 16 missing attribute values which were KNN imputed with k=7. Unique Id’s were chosen as cases, and a small amount of uniform distributed noise was added to prevent distances to equal zero. The robust normalization of Milligan and Cooper [42] was applied. The resulting dataset had 645 cases of which 413 are benign, and 232 are malignant.
Iris
“Anderson’s [Anderson, 1935] Iris data set was made famous by Fisher [Fisher, 1936], who used it to exemplify his linear discriminant analysis. It has since served to demonstrate the performance of many clustering algorithms” [G. Ritter, 2014, p. 220]. The Iris data set consists of data points in with a prior classification and describes the geographic variation of Iris flowers. The data set consists of 50 samples from each of three species of Iris flowers, namely, Iris setosa, Iris virginica and Iris versicolor. Four features were measured for each sample: the lengths and widths of the sepals and petals (see [Herrmann, 2011, pp. 99-100]). The observations have “only two digits of precision preventing general position of the data” [G. Ritter, 2014, p. 220] and “observations 102 and 142 are even equal” [G. Ritter, 2014, p. 220]. The I. setosa cluster is well separated, whereas the I. virginica and I. versicolor clusters slightly overlap (see [Herrmann, 2011, pp. 99-100]). This presents “a challenge for any sensitive classifier” [G. Ritter, 2014, p. 220].
Leukemia
The anonymized leukemia data set consists of 12,692 gene expressions66 from 554 subjects and is available from a previous publication [Haferlach et al., 2010]. Each gene expression is a logarithmic luminance intensity (presence call), which was measured using Affymetrix technology. The presence calls are related to the number of specific RNAs in a cell, which signals how active a specific gene is. Of the subjects, 109 were healthy, 15 were diagnosed with acute promyelocytic leukemia (APL), 266 had chronic lymphocytic leukemia (CLL), and 164 had acute myeloid leukemia (AML). “The study design adhered to the tenets of the Declaration of Helsinki and was approved by the ethics committees of the participating institutions before its initiation” [Haferlach et al., 2010, p. 2530]. The leukemia data set was preprocessed, resulting in a high-dimensional data set with 7.747 variables and 554 data points separated into natural clusters, as determined by the illness status and defined by discontinuities (see chapter 2). Additionally, patient consent was obtained for the data set, in accordance with the Declaration of Helsinki, and the Marburg local ethics board approved the study (No. 138/16).
The algorthms model-based clustering (MoG) and orclus were not able to compute this dataset. For QT cluserting no valid Radius could be estimated. Due to the high-dimensionality many clustering algorithms are computing the results very slowly. Thus, only ten trial per algorithm are computed.
Swiss banknotes
“The idea is to produce bills at a cost substantially lower than the imprinted number. This calls for a compromise and forgeries are not perfect” [G. Ritter, 2014, pp. 223-224]. “If a bank note is suspect but refined, then it is sent to a money-printing company, where it is carefully examined with regard to printing process, type of paper, water mark, colors, composition of inks, and more. Flury and Riedwyl [Flury/Riedwyl, 1988] had the idea to replace the features obtained from the sophisticated equipment needed for the analysis with simple linear dimensions” [G. Ritter, 2014, p. 224]. The Swiss Banknotes data set consists of six variables measured on 100 genuine and 100 counterfeit old Swiss 1000-franc bank notes. The variables are the length of the bank note, the height of the bank note (measured on the left side), the height of the bank note (measured on the right side), the distance from the inner frame to the lower border, the distance from the inner frame to the upper border and the length on the diagonal. The robust normalization of Milligan and Cooper [Milligan/Cooper, 1988] is applied to prevent a few features from dominating the obtained distances.
Pan Cancer
The Pan-Cancer datasetconsists of 801 subjects with 20531 random extractions of gene expressions, and it is a part of the RNA-Seq (HiSeq) PANCAN dataset which is maintained by the cancer genome atlas pan-cancer analysis project [39]. The dataset was taken from the UCI machine learning Repository [40]. An Illumina HiSeq platform measured RNA-Seq gene expression levels. The subjects have different types of tumor: BRCA (300 subjects), KIRC (146 subjects), COAD (78 subjects), LUAD (141 subjects) and PRAD (136 subjects). Gene expressions which were zero for all subjects were disregarded. The dataset was decorrelated and robust z-transformed. After preprocessing the high-dimensional dataset had 18617 dimensions of 801 cases.
The algorthms model-based clustering (MoG) and orclus were not able to compute this dataset. For QT cluserting no valid Radius could be estimated. Due to the high-dimensionality many clustering algorithms are computing the results very slowly. Thus, only ten trial per algorithm are computed.
Wine
The Wine data set [Aeberhard et al., 1992] is a 13-dimensional, real-valued data set. It consists of chemical measurements of wines grown in the same region in Italy but derived from three different cultivars. The robust normalization of Milligan and Cooper [Milligan/Cooper, 1988] is applied to prevent a few features from dominating the obtained distances.
References
[Ultsch et al., 1994] Ultsch, A., Guimaraes, G., Korus, D., & Li, H.: Knowledge extraction from artificial neural networks and applications, Parallele Datenverarbeitung mit dem Transputer, pp. 148-162, Springer, 1994.
[Ultsch, 1995] Ultsch, A.: Self organizing neural networks perform different from statistical k-means clustering, Proc. Society for Information and Classification (GFKL), Vol. 1995, Basel 8th-10th March, 1995.
[Ultsch, 2003a] Ultsch, A.: Maps for the visualization of high-dimensional data spaces, Proc. Workshop on Self organizing Maps (WSOM), pp. 225-230, Kyushu, Japan, 2003.
[Ultsch, 2005a] Ultsch, A.: Clustering wih SOM: U* C, Proc. Proceedings of the 5th Workshop on Self-Organizing Maps, Vol. 2, pp. 75-82, 2005.
Thrun, M. C. Projection-Based Clustering through Self-Organization and Swarm Intelligence , Springer, Heidelberg, ISBN: 978-3658205393, 2018.Please cite it accordingly. Someday, I will publish the Benchmarking in an seperate publication. The following clustering algorithms are used:
[Defays, 1977] Defays, D.: An efficient algorithm for a complete link method, The Computer Journal, Vol. 20(4), pp. 364-366. 1977. [Dimitriadou] Dimitriadou, E.: cclust–convex clustering methods and clustering indexes. R package, Version 0.6-21, 2002, [Ester et al., 1996] Ester, M., Kriegel, H.-P., Sander, J., & Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise, Proc. Kdd, Vol. 96, pp. 226-231, 1996. [Everitt et al., 2011] Everitt, B. S., Landau, S., Leese, M., & Stahl, D.: Hierarchical clustering, Cluster Analysis, 5th Edition, Vol., pp. 71-110. 2011. [Florek et al., 1951] Florek, K., Łukaszewicz, J., Perkal, J., Steinhaus, H., & Zubrzycki, S.: Sur la liaison et la division des points d'un ensemble fini, Proc. Colloquium Mathematicae, Vol. 2, pp. 282-285, Institute of Mathematics Polish Academy of Sciences, 1951. [Fraley/Raftery, 2002] Fraley, C., & Raftery, A. E.: Model-based clustering, discriminant analysis, and density estimation, Journal of the American Statistical Association, Vol. 97(458), pp. 611-631. 2002. [Fraley/Raftery, 2006] Fraley, C., & Raftery, A. E.MCLUST version 3: an R package for normal mixture modeling and model-based clustering,DTIC Document, 2006. [Frey/Dueck, 2007] Frey, B. J., & Dueck, D.: Clustering by passing messages between data points, Science, Vol. 315(5814), pp. 972-976. 2007. [Lance/Williams, 1966] Lance, G., & Williams, W.: A generalized sorting strategy for computer classifications, Nature, Vol. 212(5058), pp. 218. 1966. [Lance/Williams, 1967] Lance, G. N., & Williams, W. T.: A general theory of classificatory sorting strategies: 1. Hierarchical systems, The Computer Journal, Vol. 9(4), pp. 373-380. 1967. [Linde et al., 1980] Linde, Y., Buzo, A., & Gray, R.: An algorithm for vector quantizer design, IEEE Transactions on communications, Vol. 28(1), pp. 84-95. 1980. [Martinetz et al., 1993] Martinetz, T. M., Berkovich, S. G., & Schulten, K. J.: 'Neural-gas' network for vector quantization and its application to time-series prediction, IEEE Transactions on Neural Networks, Vol. 4(4), pp. 558-569. 1993. [McQuitty, 1966] McQuitty, L. L.: Similarity analysis by reciprocal pairs for discrete and continuous data, Educational and Psychological measurement, Vol. 26(4), pp. 825-831. 1966. [Murtagh/Legendre, 2014] Murtagh, F., & Legendre, P.: Ward’s hierarchical agglomerative clustering method: which algorithms implement Ward’s criterion?, Journal of Classification, Vol. 31(3), pp. 274-295. 2014. [Ng et al., 2002] Ng, A. Y., Jordan, M. I., & Weiss, Y.: On spectral clustering: Analysis and an algorithm, Advances in neural information processing systems, Vol. 2, pp. 849-856. 2002. [Rodriguez/Laio, 2014] Rodriguez, A., & Laio, A.: Clustering by fast search and find of density peaks, Science, Vol. 344(6191), pp. 1492-1496. 2014. [Rousseeuw/Kaufman, 1990] Rousseeuw, P. J., & Kaufman, L.: Finding groups in data, Belgium, John Wiley & Sons Inc., ISBN: 0471735787, 1990. [Sokal/Sneath, 1963] Sokal, R. R., & Sneath, P.: Principles of Numerical Taxonomy, London, Freeman, ISBN, 1963. [Sokol/Michener, 1958] Sokol, R., & Michener, C.: A statistical method for evaluating systematic relationships, Univ, Kansas Science Bulletin, Vol. 28, pp. 1409-1438. 1958. [Van Dongen, 2000] Van Dongen, S. M.:Graph clustering by flow simulation, 2000. [Ward Jr, 1963] Ward Jr, J. H.: Hierarchical grouping to optimize an objective function, Journal of the American Statistical Association, Vol. 58(301), pp. 236-244. 1963. [Wehrens/Buydens, 2007] Wehrens, R., & Buydens, L. M.: Self-and super-organizing maps in R: the Kohonen package, J Stat Softw, Vol. 21(5), pp. 1-19. 2007. [Aggarwal et al., 1999] Aggarwal, C. C., Wolf, J. L., Yu, P. S., Procopiuc, C., & Park, J. S.: Fast algorithms for projected clustering, Proc. ACM SIGMoD Record, Vol. 28, pp. 61-72, ACM, 1999. [Aggarwal/Yu, 2000] Aggarwal, C. C., & Yu, P. S.: Finding generalized projected clusters in high dimensional spaces, (Vol. 29), ACM, ISBN: 1581132174, 2000. 7. [Heyer et al., 1999] Heyer, L. J., Kruglyak, S., & Yooseph, S.: Exploring expression data: identification and analysis of coexpressed genes, Genome research, Vol. 9(11), pp. 1106-1115. 1999. [Scharl/Leisch, 2006] Scharl, T., & Leisch, F.: The stochastic QT-clust algorithm: evaluation of stability and variance on time-course microarray data, in Rizzi , A. & Vichi, M. (eds.), Proc. Proceedings in Computational Statistics (Compstat), pp. 1015–1022, Physica Verlag, Heidelberg, Germany, 2006. [Herrero et al., 2001] Herrero, J., Valencia, A., & Dopazo, J.: A hierarchical unsupervised growing neural network for clustering gene expression patterns, Bioinformatics, Vol. 17(2), pp. 126-136. 2001.This research is currently an extension of
Thrun, M. C. : Projection-Based Clustering through Self-Organization and Swarm Intelligence , Springer, Heidelberg, ISBN: 978-3658205393, 2018.Please cite it accordingly. Someday, I will publish the Benchmarking in a seperate peer-reviewed journal.