TY - JOUR
T1 - Adaptive Graph K-Means
AU - Pei, Shenfei
AU - Sun, Yuanchen
AU - Nie, Feiping
AU - Jiang, Xudong
AU - Zheng, Zengwei
N1 - Publisher Copyright:
© 2024 Elsevier Ltd
PY - 2025/5
Y1 - 2025/5
N2 - Clustering large-scale datasets has received increasing attention recently. However, existing algorithms are still not efficient in scenarios with extremely large number of clusters. To this end, Adaptive Graph K-Means (AGKM) is proposed in this work. Its idea originates from k-means, but it operates on an adaptive k-Nearest Neighbor (k-NN) graph instead of data features. First, AGKM is highly efficient for processing datasets where both the numbers of samples and clusters are very large. Specifically, the time and space complexity are both linear w.r.t the number of samples and, more importantly, independent to the cluster number. Second, AGKM is designed for balanced clusters. This constraint is realized by adding a regularization term in loss function, and a simple modification of the graph in optimization algorithm, which does not increase the computational burden. Last, the indicator and dissimilarity matrices are learned simultaneously, so that the proposed AGKM obtains the final partition directly with higher efficacy and efficiency. Experiments on several datasets validate the advantages of AGKM. In particular, over 29X and 46X speed-ups with respect to k-means are observed on the two large-scale datasets WebFace and CelebA, respectively.
AB - Clustering large-scale datasets has received increasing attention recently. However, existing algorithms are still not efficient in scenarios with extremely large number of clusters. To this end, Adaptive Graph K-Means (AGKM) is proposed in this work. Its idea originates from k-means, but it operates on an adaptive k-Nearest Neighbor (k-NN) graph instead of data features. First, AGKM is highly efficient for processing datasets where both the numbers of samples and clusters are very large. Specifically, the time and space complexity are both linear w.r.t the number of samples and, more importantly, independent to the cluster number. Second, AGKM is designed for balanced clusters. This constraint is realized by adding a regularization term in loss function, and a simple modification of the graph in optimization algorithm, which does not increase the computational burden. Last, the indicator and dissimilarity matrices are learned simultaneously, so that the proposed AGKM obtains the final partition directly with higher efficacy and efficiency. Experiments on several datasets validate the advantages of AGKM. In particular, over 29X and 46X speed-ups with respect to k-means are observed on the two large-scale datasets WebFace and CelebA, respectively.
KW - Clustering
KW - Computational efficiency
KW - Graph-based
KW - k-means
KW - Machine learning
UR - http://www.scopus.com/inward/record.url?scp=85211575636&partnerID=8YFLogxK
U2 - 10.1016/j.patcog.2024.111226
DO - 10.1016/j.patcog.2024.111226
M3 - 文章
AN - SCOPUS:85211575636
SN - 0031-3203
VL - 161
JO - Pattern Recognition
JF - Pattern Recognition
M1 - 111226
ER -