Teacher-Student Learning: Efficient Hierarchical Message Aggregation Hashing for Cross-Modal Retrieval

Wentao Tan; Lei Zhu; Jingjing Li; Huaxiang Zhang; Junwei Han

doi:10.1109/TMM.2022.3177901

Teacher-Student Learning: Efficient Hierarchical Message Aggregation Hashing for Cross-Modal Retrieval

Wentao Tan, Lei Zhu, Jingjing Li, Huaxiang Zhang, Junwei Han

School of Automation

Research output: Contribution to journal › Article › peer-review

25 Scopus citations

Abstract

Inspired by the powerful representation capability of deep neural networks, deep cross-modal hashing methods have recently drawn much attention and various deep cross-modal hashing methods have been developed. However, two key problems have not been solved well yet: 1) With advanced neural network models, how to seek the multi-modal alignment space which can effectively model the intrinsic multi-modal correlations and reduce the heterogeneous modality gaps. 2) How to effectively and efficiently preserve the modelled multi-modal semantic correlations into the binary hash codes under the deep learning paradigm. In this paper, we propose a Hierarchical Message Aggregation Hashing (HMAH) method within an efficient teacher-student learning framework. Specifically, on the teacher end, we develop hierarchical message aggregation networks to construct a multi-modal complementary space by aggregating the semantic messages hierarchically across different modalities, which can better align the heterogeneous modalities and model the fine-grained multi-modal correlations. On the student end, we train a couple of student modules that learn hash functions to support cross-modal retrieval. We design a cross-modal correlation knowledge distillation strategy which seamlessly transfers the modelled fine-grained multi-modal semantic correlations from the teacher to the lightweight student modules. With the fine-grained knowledge supervision from teacher module, the semantic representation capability of hash functions can be enhanced. In addition, the whole learning framework avoids the time-consuming finetuning on the pre-trained deep models as existing methods and it is computationally efficient. Experimental results demonstrate the significant performance improvement of the proposed method on both retrieval accuracy and efficiency, compared with the state-of-the-art deep cross-modal hashing methods.

Original language	English
Pages (from-to)	4520-4532
Number of pages	13
Journal	IEEE Transactions on Multimedia
Volume	25
DOIs	https://doi.org/10.1109/TMM.2022.3177901
State	Published - 2023

Keywords

Knowledge Distillation
Lightweight
Message Aggregation
Multimodal
Supervised Hashing

Access to Document

10.1109/TMM.2022.3177901

Cite this

@article{5f01b14c4aa04519b60d2f1b252f563f,

title = "Teacher-Student Learning: Efficient Hierarchical Message Aggregation Hashing for Cross-Modal Retrieval",

abstract = "Inspired by the powerful representation capability of deep neural networks, deep cross-modal hashing methods have recently drawn much attention and various deep cross-modal hashing methods have been developed. However, two key problems have not been solved well yet: 1) With advanced neural network models, how to seek the multi-modal alignment space which can effectively model the intrinsic multi-modal correlations and reduce the heterogeneous modality gaps. 2) How to effectively and efficiently preserve the modelled multi-modal semantic correlations into the binary hash codes under the deep learning paradigm. In this paper, we propose a Hierarchical Message Aggregation Hashing (HMAH) method within an efficient teacher-student learning framework. Specifically, on the teacher end, we develop hierarchical message aggregation networks to construct a multi-modal complementary space by aggregating the semantic messages hierarchically across different modalities, which can better align the heterogeneous modalities and model the fine-grained multi-modal correlations. On the student end, we train a couple of student modules that learn hash functions to support cross-modal retrieval. We design a cross-modal correlation knowledge distillation strategy which seamlessly transfers the modelled fine-grained multi-modal semantic correlations from the teacher to the lightweight student modules. With the fine-grained knowledge supervision from teacher module, the semantic representation capability of hash functions can be enhanced. In addition, the whole learning framework avoids the time-consuming finetuning on the pre-trained deep models as existing methods and it is computationally efficient. Experimental results demonstrate the significant performance improvement of the proposed method on both retrieval accuracy and efficiency, compared with the state-of-the-art deep cross-modal hashing methods.",

keywords = "Knowledge Distillation, Lightweight, Message Aggregation, Multimodal, Supervised Hashing",

author = "Wentao Tan and Lei Zhu and Jingjing Li and Huaxiang Zhang and Junwei Han",

note = "Publisher Copyright: {\textcopyright} 1999-2012 IEEE.",

year = "2023",

doi = "10.1109/TMM.2022.3177901",

language = "英语",

volume = "25",

pages = "4520--4532",

journal = "IEEE Transactions on Multimedia",

issn = "1520-9210",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Teacher-Student Learning

T2 - Efficient Hierarchical Message Aggregation Hashing for Cross-Modal Retrieval

AU - Tan, Wentao

AU - Zhu, Lei

AU - Li, Jingjing

AU - Zhang, Huaxiang

AU - Han, Junwei

PY - 2023

Y1 - 2023

N2 - Inspired by the powerful representation capability of deep neural networks, deep cross-modal hashing methods have recently drawn much attention and various deep cross-modal hashing methods have been developed. However, two key problems have not been solved well yet: 1) With advanced neural network models, how to seek the multi-modal alignment space which can effectively model the intrinsic multi-modal correlations and reduce the heterogeneous modality gaps. 2) How to effectively and efficiently preserve the modelled multi-modal semantic correlations into the binary hash codes under the deep learning paradigm. In this paper, we propose a Hierarchical Message Aggregation Hashing (HMAH) method within an efficient teacher-student learning framework. Specifically, on the teacher end, we develop hierarchical message aggregation networks to construct a multi-modal complementary space by aggregating the semantic messages hierarchically across different modalities, which can better align the heterogeneous modalities and model the fine-grained multi-modal correlations. On the student end, we train a couple of student modules that learn hash functions to support cross-modal retrieval. We design a cross-modal correlation knowledge distillation strategy which seamlessly transfers the modelled fine-grained multi-modal semantic correlations from the teacher to the lightweight student modules. With the fine-grained knowledge supervision from teacher module, the semantic representation capability of hash functions can be enhanced. In addition, the whole learning framework avoids the time-consuming finetuning on the pre-trained deep models as existing methods and it is computationally efficient. Experimental results demonstrate the significant performance improvement of the proposed method on both retrieval accuracy and efficiency, compared with the state-of-the-art deep cross-modal hashing methods.

AB - Inspired by the powerful representation capability of deep neural networks, deep cross-modal hashing methods have recently drawn much attention and various deep cross-modal hashing methods have been developed. However, two key problems have not been solved well yet: 1) With advanced neural network models, how to seek the multi-modal alignment space which can effectively model the intrinsic multi-modal correlations and reduce the heterogeneous modality gaps. 2) How to effectively and efficiently preserve the modelled multi-modal semantic correlations into the binary hash codes under the deep learning paradigm. In this paper, we propose a Hierarchical Message Aggregation Hashing (HMAH) method within an efficient teacher-student learning framework. Specifically, on the teacher end, we develop hierarchical message aggregation networks to construct a multi-modal complementary space by aggregating the semantic messages hierarchically across different modalities, which can better align the heterogeneous modalities and model the fine-grained multi-modal correlations. On the student end, we train a couple of student modules that learn hash functions to support cross-modal retrieval. We design a cross-modal correlation knowledge distillation strategy which seamlessly transfers the modelled fine-grained multi-modal semantic correlations from the teacher to the lightweight student modules. With the fine-grained knowledge supervision from teacher module, the semantic representation capability of hash functions can be enhanced. In addition, the whole learning framework avoids the time-consuming finetuning on the pre-trained deep models as existing methods and it is computationally efficient. Experimental results demonstrate the significant performance improvement of the proposed method on both retrieval accuracy and efficiency, compared with the state-of-the-art deep cross-modal hashing methods.

KW - Knowledge Distillation

KW - Lightweight

KW - Message Aggregation

KW - Multimodal

KW - Supervised Hashing

UR - http://www.scopus.com/inward/record.url?scp=85175440667&partnerID=8YFLogxK

U2 - 10.1109/TMM.2022.3177901

DO - 10.1109/TMM.2022.3177901

M3 - 文章

AN - SCOPUS:85175440667

SN - 1520-9210

VL - 25

SP - 4520

EP - 4532

JO - IEEE Transactions on Multimedia

JF - IEEE Transactions on Multimedia

ER -

Teacher-Student Learning: Efficient Hierarchical Message Aggregation Hashing for Cross-Modal Retrieval

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this