Özet
Spectral clustering has been popular thanks to its ability to extract clusters of varying characteristics without using a parametric model in expense of high computational cost required for eigendecomposition of pairwise similarities. In order to utilize its advantages in large datasets where it is infeasible due to its computational burden, approximate spectral clustering (ASC) methods apply spectral clustering on a reduced set of points (data representatives) selected by sampling or quantization. This two-step approach (i.e. finding the representatives and then clustering them) brings new opportunities for precise similarity definition such as manifold based topological relations, data distribution within the Voronoi polyhedra of the representatives, and their geodesic distance information, which are often ignored in similarity definition for ASC. In this study, we propose geodesic based hybrid similarity criteria which enable the use of different types of information for accurate similarity representation in ASC. Despite the fact that geodesic concept has been widely used in clustering, our contribution is the unique way of representing data topology to form geodesic relations and jointly harnessing various information types including topology, distance and density. The proposed criteria are tested using both sampling (selective sampling) and quantization (neural gas and k-means++) approaches. Experiments on artificial datasets, well-known small/medium-size real datasets, and four large datasets (four remote-sensing images), with different types of clusters, show that the proposed geodesic based hybrid similarity criteria outperform traditional similarity criteria in terms of clustering accuracies and several cluster validity indices.
Orijinal dil | İngilizce |
---|---|
Sayfa (başlangıç-bitiş) | 1465-1477 |
Sayfa sayısı | 13 |
Dergi | Pattern Recognition |
Hacim | 48 |
Basın numarası | 4 |
DOI'lar | |
Yayın durumu | Yayınlandı - 1 Nis 2015 |
Bibliyografik not
Publisher Copyright:© 2014 Elsevier Ltd. All rights reserved.
Finansman
This study is funded by TUBITAK Career Grant No. 112E195 . Taşdemir is also supported by FP7 Marie Curie Career Integration Grant IAM4MARS . Appendix A There are many different methods to determine the validity of crisp clustering (where each data point belongs to only one cluster) based on a between-clusters separation measure and a within-cluster scatter measure [42,46,47,50–55] . In this study, we focus on commonly used Davies–Bouldin index (DBI) [47] , the generalized Dunn Index (GDI) [42] , together with the Silhouette width criterion (SWC) [44] (selected the best index in [56] ), the Calinski–Harabasz variance ratio criterion (CH-VRC) [45] (selected the best among thirty indices in [53] ), and the PBM index [46] . Being proposed mainly for direct clustering, these indices may not work well for evaluation of prototype based clustering such as ASC. A recent index ConnIndex [43] , tailored for prototype based clustering methods, is also used. These indices are briefly explained below. Based on a geometric interpretation of the cluster structures, the DBI is computed by averaging the ratio of the within-cluster scatter to the between-cluster separation over all clusters. The DBI is calculated with the distances between cluster centroids ( d b _ cent ) and average distances of data points to their cluster centroid ( d w _ cent ) : (A.1) DBI = 1 K ∑ k = 1 K max m ( d w _ cent ( C k ) + d w _ cent ( C m ) d b _ cent ( C k , C m ) ) The Silhouette width criterion (SWC) [44] is based on the average distance of a data point v i to those samples within its own cluster ( d avg _ i ) and on the minimum distance of v i to those samples in other clusters ( d b _ i ). The SWC averages these values over all N samples: (A.2) SWC = 1 N ∑ i = 1 N d b _ i − d avg _ i max ( d b _ i , d avg _ i ) The variance ratio criterion of Calinski and Harabasz [45] (CH-VRC) is constructed based on the sum of squared distances between cluster centroids to the geometric center of all data points (BGSS: between group sum of squares), and on the sum of squared distances between each data point and its respective cluster centroid (WGSS: within group sum of squares): (A.3) CHVRC = BGSS / ( K − 1 ) WGSS / ( N − K ) The PBM [46] is constructed by a similar approach using three components: the average distance to the geometric center of all samples ( E 1 ); the sum of within-cluster distances of data points to their respective cluster centroid ( E K ); and the maximum distance between the centers of the K clusters ( D K ): (A.4) PBM = ( 1 K E 1 E K D K ) 2 The generalized Dunn index GDI [42] is calculated as (A.5) GDI = min m { min n { d b ( C m , C n ) max k d w ( C k ) } } where C m , C n and C k are clusters; d b is a between-cluster separation measure and d w is a within-cluster scatter measure based on user-set choices of distances. In this study, d b is selected as the centroid linkage (distance between centroids) and d w is selected as the average distance to cluster centroid. The ConnIndex [43] is constructed by two quantities: the intra-cluster connectivity ( Intra _ Conn ) as the within-cluster scatter and the complement of the inter-cluster connectivity, ( 1 − Inter _ Conn ), as the between-cluster separation measure: (A.6) ConnIndex = Intra _ Conn × ( 1 − Inter _ Conn ) On the one hand, Intra _ Conn is the average of intra-cluster connectivities which are the ratios of the number of data points in C k which have both their BMU and second BMU in C k to the total number of data points in C k . Namely, it shows the average ratio of the connections between the prototypes within the same cluster, producing a value in [0,1]. On the other hand, Inter _ Conn is the average of maximum inter-cluster connectivities of each cluster, InterC ( C k , C l ) , where InterC ( C k , C l ) is the ratio of the sum of the connectivity strengths between C k and C l , Conn ( C k , C l ) , to the sum of the connectivity strengths of those prototypes in C k which have at least one connection to a prototype in C l . Namely, let W k , l be the set of prototypes in C k which have at least one connection to a prototype in C l , then (A.7) InterC ( C k , C l ) = { 0 if W k , l = ∅ Conn ( C k , C l ) ∑ i , j { CONN ( i , j ) : w i ∈ W k , l } if W k , l ≠ ∅ Inter _ Conn ( C k , C l ) shows how similar the prototypes at the boundary of C k are to the ones at the boundary of C l in comparison to the similarity of the prototypes within C k . A greater Inter _ Conn ( C k , C l ) is an indication of a greater degree of similarity between C k and C l . Kadim Tasdemir received his B.S. degree from Bogazici University, Istanbul, Turkey, in 2001, his M.S. degree from Istanbul Technical University in 2004, and his Ph.D. degree from Rice University, Houston, TX, in 2008. Before joining AIU, he was a researcher at the European Commission Joint Research Centre (JRC, Institute for Environment and Sustainability) from 2009 to 2012, where he worked on automated control methods for monitoring agricultural resources using remote sensing imagery. Based on his research excellence and contribution, he received 2011 IES Best Young Scientist Award. Before JRC, he worked as an assistant professor at the Department of Computer Engineering, Yasar University, Izmir, Turkey, in 2008 and 2009. During 2003–2008, he was a research assistant at Rice University, where he developed visualization and clustering methods using neural computation for detailed knowledge discovery, sponsored by NASA Applied Information Systems Research Program. He was also awarded “Rice University Robert Patten Award” for his contributions to graduate life. During 2001–2003, he was a research assistant at Istanbul Technical University, where he worked on License Plate Recognition project. Dr. Tasdemir׳s research interests include detailed knowledge discovery from high-dimensional and large datasets (especially remote sensing imagery) using machine learning (self-organized learning in particular), data mining and pattern recognition. Recently, he received TUBITAK Career Grant for clustering large datasets and FP7 Marie Curie Career Integration Grant for automated methods in agriculture monitoring with remote sensing. He is a member of IAPR, IAPR-TC7 Remote Sensing and Mapping, and IAPR-TC15 Graph Based Representations, IEEE, IEEE Computational Intelligence Society and IEEE Geoscience and Remote Sensing Society. Berna Yalçin is a graduate student at the Department of Biomedical Engineering, Istanbul Technical University. She received her B.S. degree in 2011 from the Department of Electronics Engineering, Ankara University. Her research interests are unsupervised clustering, large datasets, bioinformatics, and pattern recognition. Her current research is funded by TUBITAK Carrier Grant no. 112E195. Isa Yildirim is an assistant professor at the Department of Electronics and Communication Engineering in Istanbul Technical University. He received his Ph.D. degree in 2009 in Electrical and Computer Engineering at the University of Illinois at Chicago, USA, and his M.Sc. and B.Sc. degrees in 2002 and 2004 in Satellite Communications and Remote Sensing Program and Electronics and Communication Engineering respectively at Istanbul Technical University, Turkey.
Finansörler | Finansör numarası |
---|---|
FP7 Marie Curie Career Integration | IAM4MARS |
NASA Applied Information Systems Research Program | |
TUBITAK | 112E195 |
TUBITAK Carrier | |
Rice University | |
FP7 People: Marie-Curie Actions |