Deep Non-Rigid Structure-From-Motion: A Sequence-to-Sequence Translation Perspective

Hui Deng; Tong Zhang; Yuchao Dai; Jiawei Shi; Yiran Zhong; Hongdong Li

doi:10.1109/TPAMI.2024.3443922

Deep Non-Rigid Structure-From-Motion: A Sequence-to-Sequence Translation Perspective

Hui Deng, Tong Zhang, Yuchao Dai, Jiawei Shi, Yiran Zhong, Hongdong Li

电子信息学院

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

Directly regressing the non-rigid shape and camera pose from the individual 2D frame is ill-suited to the Non-Rigid Structure-from-Motion (NRSfM) problem. This frame-by-frame 3D reconstruction pipeline overlooks the inherent spatial-temporal nature of NRSfM, i.e., reconstructing the 3D sequence from the input 2D sequence. In this paper, we propose to solve deep sparse NRSfM from a sequence-to-sequence translation perspective, where the input 2D keypoints sequence is taken as a whole to reconstruct the corresponding 3D keypoints sequence in a self-supervised manner. First, we apply a shape-motion predictor on the input sequence to obtain an initial sequence of shapes and corresponding motions. Then, we propose the Context Layer, which enables the deep learning framework to effectively impose overall constraints on sequences based on the structural characteristics of non-rigid sequences. The Context Layer constructs modules for imposing the self-expressiveness regularity on non-rigid sequences with multi-head attention (MHA) as the core, together with the use of temporal encoding, both of which act simultaneously to constitute constraints on non-rigid sequences in the deep framework. Experimental results across different datasets such as Human3.6M, CMU Mocap, and InterHand prove the superiority of our framework. The code will be made publicly available.

源语言	英语
页（从-至）	10814-10828
页数	15
期刊	IEEE Transactions on Pattern Analysis and Machine Intelligence
卷	46
期	12
DOI	https://doi.org/10.1109/TPAMI.2024.3443922
出版状态	已出版 - 2024

访问文件

10.1109/TPAMI.2024.3443922

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{1b998fa560d34aac85d35d33b3088d5e,

title = "Deep Non-Rigid Structure-From-Motion: A Sequence-to-Sequence Translation Perspective",

abstract = "Directly regressing the non-rigid shape and camera pose from the individual 2D frame is ill-suited to the Non-Rigid Structure-from-Motion (NRSfM) problem. This frame-by-frame 3D reconstruction pipeline overlooks the inherent spatial-temporal nature of NRSfM, i.e., reconstructing the 3D sequence from the input 2D sequence. In this paper, we propose to solve deep sparse NRSfM from a sequence-to-sequence translation perspective, where the input 2D keypoints sequence is taken as a whole to reconstruct the corresponding 3D keypoints sequence in a self-supervised manner. First, we apply a shape-motion predictor on the input sequence to obtain an initial sequence of shapes and corresponding motions. Then, we propose the Context Layer, which enables the deep learning framework to effectively impose overall constraints on sequences based on the structural characteristics of non-rigid sequences. The Context Layer constructs modules for imposing the self-expressiveness regularity on non-rigid sequences with multi-head attention (MHA) as the core, together with the use of temporal encoding, both of which act simultaneously to constitute constraints on non-rigid sequences in the deep framework. Experimental results across different datasets such as Human3.6M, CMU Mocap, and InterHand prove the superiority of our framework. The code will be made publicly available.",

keywords = "Non-rigid structure-from-motion (NRSfM), self- expressiveness, self-attention, sequence-to-sequence, temporal encoding",

author = "Hui Deng and Tong Zhang and Yuchao Dai and Jiawei Shi and Yiran Zhong and Hongdong Li",

note = "Publisher Copyright: {\textcopyright} 1979-2012 IEEE.",

year = "2024",

doi = "10.1109/TPAMI.2024.3443922",

language = "英语",

volume = "46",

pages = "10814--10828",

journal = "IEEE Transactions on Pattern Analysis and Machine Intelligence",

issn = "0162-8828",

publisher = "IEEE Computer Society",

number = "12",

}

TY - JOUR

T1 - Deep Non-Rigid Structure-From-Motion

T2 - A Sequence-to-Sequence Translation Perspective

AU - Deng, Hui

AU - Zhang, Tong

AU - Dai, Yuchao

AU - Shi, Jiawei

AU - Zhong, Yiran

AU - Li, Hongdong

PY - 2024

Y1 - 2024

N2 - Directly regressing the non-rigid shape and camera pose from the individual 2D frame is ill-suited to the Non-Rigid Structure-from-Motion (NRSfM) problem. This frame-by-frame 3D reconstruction pipeline overlooks the inherent spatial-temporal nature of NRSfM, i.e., reconstructing the 3D sequence from the input 2D sequence. In this paper, we propose to solve deep sparse NRSfM from a sequence-to-sequence translation perspective, where the input 2D keypoints sequence is taken as a whole to reconstruct the corresponding 3D keypoints sequence in a self-supervised manner. First, we apply a shape-motion predictor on the input sequence to obtain an initial sequence of shapes and corresponding motions. Then, we propose the Context Layer, which enables the deep learning framework to effectively impose overall constraints on sequences based on the structural characteristics of non-rigid sequences. The Context Layer constructs modules for imposing the self-expressiveness regularity on non-rigid sequences with multi-head attention (MHA) as the core, together with the use of temporal encoding, both of which act simultaneously to constitute constraints on non-rigid sequences in the deep framework. Experimental results across different datasets such as Human3.6M, CMU Mocap, and InterHand prove the superiority of our framework. The code will be made publicly available.

AB - Directly regressing the non-rigid shape and camera pose from the individual 2D frame is ill-suited to the Non-Rigid Structure-from-Motion (NRSfM) problem. This frame-by-frame 3D reconstruction pipeline overlooks the inherent spatial-temporal nature of NRSfM, i.e., reconstructing the 3D sequence from the input 2D sequence. In this paper, we propose to solve deep sparse NRSfM from a sequence-to-sequence translation perspective, where the input 2D keypoints sequence is taken as a whole to reconstruct the corresponding 3D keypoints sequence in a self-supervised manner. First, we apply a shape-motion predictor on the input sequence to obtain an initial sequence of shapes and corresponding motions. Then, we propose the Context Layer, which enables the deep learning framework to effectively impose overall constraints on sequences based on the structural characteristics of non-rigid sequences. The Context Layer constructs modules for imposing the self-expressiveness regularity on non-rigid sequences with multi-head attention (MHA) as the core, together with the use of temporal encoding, both of which act simultaneously to constitute constraints on non-rigid sequences in the deep framework. Experimental results across different datasets such as Human3.6M, CMU Mocap, and InterHand prove the superiority of our framework. The code will be made publicly available.

KW - Non-rigid structure-from-motion (NRSfM)

KW - self- expressiveness

KW - self-attention

KW - sequence-to-sequence

KW - temporal encoding

UR - http://www.scopus.com/inward/record.url?scp=85201443380&partnerID=8YFLogxK

U2 - 10.1109/TPAMI.2024.3443922

DO - 10.1109/TPAMI.2024.3443922

M3 - 文章

C2 - 39150802

AN - SCOPUS:85201443380

SN - 0162-8828

VL - 46

SP - 10814

EP - 10828

JO - IEEE Transactions on Pattern Analysis and Machine Intelligence

JF - IEEE Transactions on Pattern Analysis and Machine Intelligence

IS - 12

ER -

Deep Non-Rigid Structure-From-Motion: A Sequence-to-Sequence Translation Perspective

摘要

访问文件

其它文件与链接

指纹

引用此