MaskRecon: High-quality human reconstruction via masked autoencoders using a single RGB-D image

Xing Li; Yangyu Fan; Zhe Guo; Zhibo Rao; Yu Duan; Shiya Liu

doi:10.1016/j.neucom.2024.128487

MaskRecon: High-quality human reconstruction via masked autoencoders using a single RGB-D image

Xing Li, Yangyu Fan, Zhe Guo, Zhibo Rao, Yu Duan, Shiya Liu

电子信息学院

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

In this paper, we explore reconstructing high-quality clothed 3D humans from a single RGB-D image, assuming that virtual humans can be represented by front-view and back-view depths. Due to the scarcity of captured real RGB-D human images, we employ rendered images to train our method. However, rendered images lack background with significant depth variation in silhouettes, leading to shape prediction inaccuracies and noise. To mitigate this issue, we introduce a pseudo-multi-task framework, which incorporates a Conditional Generative Adversarial Network (CGAN) to infer back-view RGB-D images and a self-supervised Masked Autoencoder (MAE) to capture latent structural information of the human body. Additionally, we propose a Multi-scale Feature Fusion (MFF) module to effectively merge structural information and conditional features at various scales. Our method surpasses many existing techniques, as demonstrated through evaluations on the Thuman, RenderPeople, and BUFF datasets. Notably, our approach excels in reconstructing high-quality human models, even under challenging conditions such as complex poses and loose clothing, both on rendered and real-world images. Codes are available at https://github.com/Archaic-Atom/MaskRecon.

源语言	英语
文章编号	128487
期刊	Neurocomputing
卷	609
DOI	https://doi.org/10.1016/j.neucom.2024.128487
出版状态	已出版 - 7 12月 2024

访问文件

10.1016/j.neucom.2024.128487

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{49033ad47180499b9de8964b152829c5,

title = "MaskRecon: High-quality human reconstruction via masked autoencoders using a single RGB-D image",

abstract = "In this paper, we explore reconstructing high-quality clothed 3D humans from a single RGB-D image, assuming that virtual humans can be represented by front-view and back-view depths. Due to the scarcity of captured real RGB-D human images, we employ rendered images to train our method. However, rendered images lack background with significant depth variation in silhouettes, leading to shape prediction inaccuracies and noise. To mitigate this issue, we introduce a pseudo-multi-task framework, which incorporates a Conditional Generative Adversarial Network (CGAN) to infer back-view RGB-D images and a self-supervised Masked Autoencoder (MAE) to capture latent structural information of the human body. Additionally, we propose a Multi-scale Feature Fusion (MFF) module to effectively merge structural information and conditional features at various scales. Our method surpasses many existing techniques, as demonstrated through evaluations on the Thuman, RenderPeople, and BUFF datasets. Notably, our approach excels in reconstructing high-quality human models, even under challenging conditions such as complex poses and loose clothing, both on rendered and real-world images. Codes are available at https://github.com/Archaic-Atom/MaskRecon.",

keywords = "Masked autoencoder, Multi-scale feature fusion, Pseudo-multi-task framework, Reconstruct clothed 3D human, Single RGB-D image",

author = "Xing Li and Yangyu Fan and Zhe Guo and Zhibo Rao and Yu Duan and Shiya Liu",

note = "Publisher Copyright: {\textcopyright} 2024 Elsevier B.V.",

year = "2024",

month = dec,

day = "7",

doi = "10.1016/j.neucom.2024.128487",

language = "英语",

volume = "609",

journal = "Neurocomputing",

issn = "0925-2312",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - MaskRecon

T2 - High-quality human reconstruction via masked autoencoders using a single RGB-D image

AU - Li, Xing

AU - Fan, Yangyu

AU - Guo, Zhe

AU - Rao, Zhibo

AU - Duan, Yu

AU - Liu, Shiya

PY - 2024/12/7

Y1 - 2024/12/7

N2 - In this paper, we explore reconstructing high-quality clothed 3D humans from a single RGB-D image, assuming that virtual humans can be represented by front-view and back-view depths. Due to the scarcity of captured real RGB-D human images, we employ rendered images to train our method. However, rendered images lack background with significant depth variation in silhouettes, leading to shape prediction inaccuracies and noise. To mitigate this issue, we introduce a pseudo-multi-task framework, which incorporates a Conditional Generative Adversarial Network (CGAN) to infer back-view RGB-D images and a self-supervised Masked Autoencoder (MAE) to capture latent structural information of the human body. Additionally, we propose a Multi-scale Feature Fusion (MFF) module to effectively merge structural information and conditional features at various scales. Our method surpasses many existing techniques, as demonstrated through evaluations on the Thuman, RenderPeople, and BUFF datasets. Notably, our approach excels in reconstructing high-quality human models, even under challenging conditions such as complex poses and loose clothing, both on rendered and real-world images. Codes are available at https://github.com/Archaic-Atom/MaskRecon.

AB - In this paper, we explore reconstructing high-quality clothed 3D humans from a single RGB-D image, assuming that virtual humans can be represented by front-view and back-view depths. Due to the scarcity of captured real RGB-D human images, we employ rendered images to train our method. However, rendered images lack background with significant depth variation in silhouettes, leading to shape prediction inaccuracies and noise. To mitigate this issue, we introduce a pseudo-multi-task framework, which incorporates a Conditional Generative Adversarial Network (CGAN) to infer back-view RGB-D images and a self-supervised Masked Autoencoder (MAE) to capture latent structural information of the human body. Additionally, we propose a Multi-scale Feature Fusion (MFF) module to effectively merge structural information and conditional features at various scales. Our method surpasses many existing techniques, as demonstrated through evaluations on the Thuman, RenderPeople, and BUFF datasets. Notably, our approach excels in reconstructing high-quality human models, even under challenging conditions such as complex poses and loose clothing, both on rendered and real-world images. Codes are available at https://github.com/Archaic-Atom/MaskRecon.

KW - Masked autoencoder

KW - Multi-scale feature fusion

KW - Pseudo-multi-task framework

KW - Reconstruct clothed 3D human

KW - Single RGB-D image

UR - http://www.scopus.com/inward/record.url?scp=85202556229&partnerID=8YFLogxK

U2 - 10.1016/j.neucom.2024.128487

DO - 10.1016/j.neucom.2024.128487

M3 - 文章

AN - SCOPUS:85202556229

SN - 0925-2312

VL - 609

JO - Neurocomputing

JF - Neurocomputing

M1 - 128487

ER -

MaskRecon: High-quality human reconstruction via masked autoencoders using a single RGB-D image

摘要

访问文件

其它文件与链接

指纹

引用此