TY - GEN
T1 - FVSF
T2 - 2026 International Conference on Electronics, Information, and Communication, ICEIC 2026
AU - Wang, Shuaitong
AU - Zhang, Kexin
AU - Qin, Zixuan
AU - Wei, Baoguo
AU - Li, Xu
AU - Lin, Wensheng
AU - Li, Lixin
N1 - Publisher Copyright:
© 2026 IEEE.
PY - 2026
Y1 - 2026
N2 - Few-Shot Learning (FSL) aims to recognize novel categories from sparse data but is fundamentally challenged by high intra-class variance and ambiguous feature representations. Existing methods are often limited by unreliable unimodal visual features or fail to bridge the modality gap due to inadequate semantic priors. To address these limitations, we introduce FVSF (Fusion of Visual and Semantic Features), a framework that constructs a highly discriminative embedding space by synergizing three key components. First, a Swin Transformerbased visual fusion module captures a rich hierarchy of visual features, from fine-grained textures to high-order semantics. Second, a Large Language Model (LLM)-driven pipeline generates descriptive, paragraph-level semantic representations for each category, resolving the ambiguity of conventional class labels. Third, a self-supervised contrastive learning strategy refines the embedding space to enhance intra-class compactness. Comprehensive experiments on standard FSL benchmarks, including MiniImageNet, CIFAR-FS, and FC100, demonstrate that FVSF significantly outperforms state-of-the-art methods, establishing a new performance benchmark. Our results confirm that the systematic integration of multimodal information provides a robust solution to the learning challenges posed by data scarcity in FSL.
AB - Few-Shot Learning (FSL) aims to recognize novel categories from sparse data but is fundamentally challenged by high intra-class variance and ambiguous feature representations. Existing methods are often limited by unreliable unimodal visual features or fail to bridge the modality gap due to inadequate semantic priors. To address these limitations, we introduce FVSF (Fusion of Visual and Semantic Features), a framework that constructs a highly discriminative embedding space by synergizing three key components. First, a Swin Transformerbased visual fusion module captures a rich hierarchy of visual features, from fine-grained textures to high-order semantics. Second, a Large Language Model (LLM)-driven pipeline generates descriptive, paragraph-level semantic representations for each category, resolving the ambiguity of conventional class labels. Third, a self-supervised contrastive learning strategy refines the embedding space to enhance intra-class compactness. Comprehensive experiments on standard FSL benchmarks, including MiniImageNet, CIFAR-FS, and FC100, demonstrate that FVSF significantly outperforms state-of-the-art methods, establishing a new performance benchmark. Our results confirm that the systematic integration of multimodal information provides a robust solution to the learning challenges posed by data scarcity in FSL.
KW - Feature fusion
KW - Few-shot
KW - Self-supervised
KW - Semantic
UR - https://www.scopus.com/pages/publications/105034889865
U2 - 10.1109/ICEIC69189.2026.11386329
DO - 10.1109/ICEIC69189.2026.11386329
M3 - 会议稿件
AN - SCOPUS:105034889865
T3 - 2026 International Conference on Electronics, Information, and Communication, ICEIC 2026
BT - 2026 International Conference on Electronics, Information, and Communication, ICEIC 2026
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 18 January 2026 through 21 January 2026
ER -