Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training

Renrui Zhang; Ziyu Guo; Rongyao Fang; Bin Zhao; Dong Wang; Yu Qiao; Hongsheng Li; Peng Gao

Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training

Renrui Zhang, Ziyu Guo, Rongyao Fang, Bin Zhao, Dong Wang, Yu Qiao, Hongsheng Li, Peng Gao

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

126 Scopus citations

Abstract

Masked Autoencoders (MAE) have shown great potentials in self-supervised pretraining for language and 2D image transformers. However, it still remains an open question on how to exploit masked autoencoding for learning 3D representations of irregular point clouds. In this paper, we propose Point-M2AE, a strong Multi-scale MAE pre-training framework for hierarchical self-supervised learning of 3D point clouds. Unlike the standard transformer in MAE, we modify the encoder and decoder into pyramid architectures to progressively model spatial geometries and capture both fine-grained and high-level semantics of 3D shapes. For the encoder that downsamples point tokens by stages, we design a multi-scale masking strategy to generate consistent visible regions across scales, and adopt a local spatial self-attention mechanism during fine-tuning to focus on neighboring patterns. By multi-scale token propagation, the lightweight decoder gradually upsamples point tokens with complementary skip connections from the encoder, which further promotes the reconstruction from a global-to-local perspective. Extensive experiments demonstrate the state-of-the-art performance of Point-M2AE for 3D representation learning. With a frozen encoder after pretraining, Point-M2AE achieves 92.9% accuracy for linear SVM on ModelNet40, even surpassing some fully trained methods. By fine-tuning on downstream tasks, Point-M2AE achieves 86.43% accuracy on ScanObjectNN, +3.36% to the second-best, and largely benefits the few-shot classification, part segmentation and 3D object detection with the hierarchical pre-training scheme. Code is available at https://github.com/ZrrSkywalker/Point-M2AE.

Original language	English
Title of host publication	Advances in Neural Information Processing Systems 35 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022
Editors	S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh
Publisher	Neural information processing systems foundation
ISBN (Electronic)	9781713871088
State	Published - 2022
Externally published	Yes
Event	36th Conference on Neural Information Processing Systems, NeurIPS 2022 - New Orleans, United States Duration: 28 Nov 2022 → 9 Dec 2022

Publication series

Name	Advances in Neural Information Processing Systems
Volume	35
ISSN (Print)	1049-5258

Conference

Conference	36th Conference on Neural Information Processing Systems, NeurIPS 2022
Country/Territory	United States
City	New Orleans
Period	28/11/22 → 9/12/22

Cite this

Zhang, R., Guo, Z., Fang, R., Zhao, B., Wang, D., Qiao, Y., Li, H., & Gao, P. (2022). Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, & A. Oh (Eds.), Advances in Neural Information Processing Systems 35 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022 (Advances in Neural Information Processing Systems; Vol. 35). Neural information processing systems foundation.

Zhang, Renrui ; Guo, Ziyu ; Fang, Rongyao et al. / Point-M2AE : Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training. Advances in Neural Information Processing Systems 35 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022. editor / S. Koyejo ; S. Mohamed ; A. Agarwal ; D. Belgrave ; K. Cho ; A. Oh. Neural information processing systems foundation, 2022. (Advances in Neural Information Processing Systems).

@inproceedings{c77af650d29447a58ce4a0110e7f857a,

title = "Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training",

abstract = "Masked Autoencoders (MAE) have shown great potentials in self-supervised pretraining for language and 2D image transformers. However, it still remains an open question on how to exploit masked autoencoding for learning 3D representations of irregular point clouds. In this paper, we propose Point-M2AE, a strong Multi-scale MAE pre-training framework for hierarchical self-supervised learning of 3D point clouds. Unlike the standard transformer in MAE, we modify the encoder and decoder into pyramid architectures to progressively model spatial geometries and capture both fine-grained and high-level semantics of 3D shapes. For the encoder that downsamples point tokens by stages, we design a multi-scale masking strategy to generate consistent visible regions across scales, and adopt a local spatial self-attention mechanism during fine-tuning to focus on neighboring patterns. By multi-scale token propagation, the lightweight decoder gradually upsamples point tokens with complementary skip connections from the encoder, which further promotes the reconstruction from a global-to-local perspective. Extensive experiments demonstrate the state-of-the-art performance of Point-M2AE for 3D representation learning. With a frozen encoder after pretraining, Point-M2AE achieves 92.9% accuracy for linear SVM on ModelNet40, even surpassing some fully trained methods. By fine-tuning on downstream tasks, Point-M2AE achieves 86.43% accuracy on ScanObjectNN, +3.36% to the second-best, and largely benefits the few-shot classification, part segmentation and 3D object detection with the hierarchical pre-training scheme. Code is available at https://github.com/ZrrSkywalker/Point-M2AE.",

author = "Renrui Zhang and Ziyu Guo and Rongyao Fang and Bin Zhao and Dong Wang and Yu Qiao and Hongsheng Li and Peng Gao",

note = "Publisher Copyright: {\textcopyright} 2022 Neural information processing systems foundation. All rights reserved.; 36th Conference on Neural Information Processing Systems, NeurIPS 2022 ; Conference date: 28-11-2022 Through 09-12-2022",

year = "2022",

language = "英语",

series = "Advances in Neural Information Processing Systems",

publisher = "Neural information processing systems foundation",

editor = "S. Koyejo and S. Mohamed and A. Agarwal and D. Belgrave and K. Cho and A. Oh",

booktitle = "Advances in Neural Information Processing Systems 35 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022",

}

Zhang, R, Guo, Z, Fang, R, Zhao, B, Wang, D, Qiao, Y, Li, H & Gao, P 2022, Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training. in S Koyejo, S Mohamed, A Agarwal, D Belgrave, K Cho & A Oh (eds), Advances in Neural Information Processing Systems 35 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022. Advances in Neural Information Processing Systems, vol. 35, Neural information processing systems foundation, 36th Conference on Neural Information Processing Systems, NeurIPS 2022, New Orleans, United States, 28/11/22.

Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training. / Zhang, Renrui; Guo, Ziyu; Fang, Rongyao et al.
Advances in Neural Information Processing Systems 35 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022. ed. / S. Koyejo; S. Mohamed; A. Agarwal; D. Belgrave; K. Cho; A. Oh. Neural information processing systems foundation, 2022. (Advances in Neural Information Processing Systems; Vol. 35).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Point-M2AE

T2 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022

AU - Zhang, Renrui

AU - Guo, Ziyu

AU - Fang, Rongyao

AU - Zhao, Bin

AU - Wang, Dong

AU - Qiao, Yu

AU - Li, Hongsheng

AU - Gao, Peng

PY - 2022

Y1 - 2022

N2 - Masked Autoencoders (MAE) have shown great potentials in self-supervised pretraining for language and 2D image transformers. However, it still remains an open question on how to exploit masked autoencoding for learning 3D representations of irregular point clouds. In this paper, we propose Point-M2AE, a strong Multi-scale MAE pre-training framework for hierarchical self-supervised learning of 3D point clouds. Unlike the standard transformer in MAE, we modify the encoder and decoder into pyramid architectures to progressively model spatial geometries and capture both fine-grained and high-level semantics of 3D shapes. For the encoder that downsamples point tokens by stages, we design a multi-scale masking strategy to generate consistent visible regions across scales, and adopt a local spatial self-attention mechanism during fine-tuning to focus on neighboring patterns. By multi-scale token propagation, the lightweight decoder gradually upsamples point tokens with complementary skip connections from the encoder, which further promotes the reconstruction from a global-to-local perspective. Extensive experiments demonstrate the state-of-the-art performance of Point-M2AE for 3D representation learning. With a frozen encoder after pretraining, Point-M2AE achieves 92.9% accuracy for linear SVM on ModelNet40, even surpassing some fully trained methods. By fine-tuning on downstream tasks, Point-M2AE achieves 86.43% accuracy on ScanObjectNN, +3.36% to the second-best, and largely benefits the few-shot classification, part segmentation and 3D object detection with the hierarchical pre-training scheme. Code is available at https://github.com/ZrrSkywalker/Point-M2AE.

AB - Masked Autoencoders (MAE) have shown great potentials in self-supervised pretraining for language and 2D image transformers. However, it still remains an open question on how to exploit masked autoencoding for learning 3D representations of irregular point clouds. In this paper, we propose Point-M2AE, a strong Multi-scale MAE pre-training framework for hierarchical self-supervised learning of 3D point clouds. Unlike the standard transformer in MAE, we modify the encoder and decoder into pyramid architectures to progressively model spatial geometries and capture both fine-grained and high-level semantics of 3D shapes. For the encoder that downsamples point tokens by stages, we design a multi-scale masking strategy to generate consistent visible regions across scales, and adopt a local spatial self-attention mechanism during fine-tuning to focus on neighboring patterns. By multi-scale token propagation, the lightweight decoder gradually upsamples point tokens with complementary skip connections from the encoder, which further promotes the reconstruction from a global-to-local perspective. Extensive experiments demonstrate the state-of-the-art performance of Point-M2AE for 3D representation learning. With a frozen encoder after pretraining, Point-M2AE achieves 92.9% accuracy for linear SVM on ModelNet40, even surpassing some fully trained methods. By fine-tuning on downstream tasks, Point-M2AE achieves 86.43% accuracy on ScanObjectNN, +3.36% to the second-best, and largely benefits the few-shot classification, part segmentation and 3D object detection with the hierarchical pre-training scheme. Code is available at https://github.com/ZrrSkywalker/Point-M2AE.

UR - http://www.scopus.com/inward/record.url?scp=85148273586&partnerID=8YFLogxK

M3 - 会议稿件

AN - SCOPUS:85148273586

T3 - Advances in Neural Information Processing Systems

BT - Advances in Neural Information Processing Systems 35 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022

A2 - Koyejo, S.

A2 - Mohamed, S.

A2 - Agarwal, A.

A2 - Belgrave, D.

A2 - Cho, K.

A2 - Oh, A.

PB - Neural information processing systems foundation

Y2 - 28 November 2022 through 9 December 2022

ER -

Zhang R, Guo Z, Fang R, Zhao B, Wang D, Qiao Y et al. Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training. In Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, editors, Advances in Neural Information Processing Systems 35 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022. Neural information processing systems foundation. 2022. (Advances in Neural Information Processing Systems).

Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training

Abstract

Publication series

Conference

Other files and links

Fingerprint

Cite this