TY - GEN
T1 - Wi-CLIP
T2 - 3rd International Conference on Artificial Intelligence of Things and Systems, AIoTSys 2025
AU - Zhang, Haoyu
AU - Guo, Yifan
AU - Wang, Zhu
AU - Sun, Zhuo
AU - Guo, Bin
AU - Yu, Zhiwen
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2026.
PY - 2026
Y1 - 2026
N2 - Wi-Fi-based gesture recognition, driven by deep learning, holds significant promise for privacy-preserving and all-weather sensing. However, current methods typically rely on large amounts of labeled data, and Wi-Fi signals vary significantly across gestures, leading to severe performance degradation when models encounter unseen gestures. To address these challenges, we explore the potential of transferring knowledge from large pre-trained language models to improve the generalization of Wi-Fi-based gesture recognition systems. To this end, we propose a zero-shot gesture recognition framework, named Wi-CLIP. Inspired by the vision-language pre-training model CLIP, our method constructs a cross-modal radio frequency-text model centered on aligning Wi-Fi signals with textual semantics. Specifically, we develop a novel Wi-Fi signal encoder and a BERT-based text encoder, aligning the two modalities within a shared semantic space using contrastive learning. Our framework achieves an average recognition accuracy of 89.12% across 6 gestures. Notably, when trained on only 5 gestures, Wi-CLIP demonstrates a remarkable zero-shot recognition accuracy of 78.79% on the sixth, previously unseen gesture. This highlights its strong generalization capability and effectiveness in cross-modal representation learning.
AB - Wi-Fi-based gesture recognition, driven by deep learning, holds significant promise for privacy-preserving and all-weather sensing. However, current methods typically rely on large amounts of labeled data, and Wi-Fi signals vary significantly across gestures, leading to severe performance degradation when models encounter unseen gestures. To address these challenges, we explore the potential of transferring knowledge from large pre-trained language models to improve the generalization of Wi-Fi-based gesture recognition systems. To this end, we propose a zero-shot gesture recognition framework, named Wi-CLIP. Inspired by the vision-language pre-training model CLIP, our method constructs a cross-modal radio frequency-text model centered on aligning Wi-Fi signals with textual semantics. Specifically, we develop a novel Wi-Fi signal encoder and a BERT-based text encoder, aligning the two modalities within a shared semantic space using contrastive learning. Our framework achieves an average recognition accuracy of 89.12% across 6 gestures. Notably, when trained on only 5 gestures, Wi-CLIP demonstrates a remarkable zero-shot recognition accuracy of 78.79% on the sixth, previously unseen gesture. This highlights its strong generalization capability and effectiveness in cross-modal representation learning.
KW - Gesture Recognition
KW - Vision Language Model
KW - Wireless Sensing
KW - Zero Shot Learning
UR - https://www.scopus.com/pages/publications/105028088974
U2 - 10.1007/978-981-95-2581-2_12
DO - 10.1007/978-981-95-2581-2_12
M3 - 会议稿件
AN - SCOPUS:105028088974
SN - 9789819525805
T3 - Communications in Computer and Information Science
SP - 174
EP - 189
BT - Artificial Intelligence of Things and Systems - 3rd International Conference, AIoTSys 2025, Proceedings
A2 - Liu, Sicong
A2 - Zheng, Xiaolong
A2 - Ma, Dong
A2 - Wu, Yuezhong
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 15 August 2025 through 17 August 2025
ER -