TY - GEN
T1 - The Hearing Impairment Phenomenon in Audio-Visual Sound Source Localization
AU - Liu, Tianyu
AU - Zhang, Peng
N1 - Publisher Copyright:
©2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Audio-visual sound source localization (AV-SSL) leverages audio to identify the sounding object within the visual space. By mapping visual and audio modality representations into a shared space, cosine similarity-based methods have demonstrated strong localization performance. In this work, we discover a phenomenon in existing methods, termed Hearing Impairment (HI), which refers to the scenario where the network localizes a specific object in the image regardless of the input audio. To measure the extent of HI, three additional audio-visual mismatched datasets (Un-VGGSS, Un-S4 and Un-AVSS) are constructed and a novel metric is introduced, which is combined with mIoU to evaluate sound source localization performance comprehensively. We trained using the latest six AV-SSL methods on VGGSound, S4, and AVSS datasets, then evaluated them on VGGSS, S4 Test, and AVSS Test. Results indicate that some specific methods perform well in localization, but fail to distinguish whether the visual object is producing the sound. Future work should incorporate the evaluation of the model’s ability to differentiate sounding objects, rather than only focus on localization accuracy.
AB - Audio-visual sound source localization (AV-SSL) leverages audio to identify the sounding object within the visual space. By mapping visual and audio modality representations into a shared space, cosine similarity-based methods have demonstrated strong localization performance. In this work, we discover a phenomenon in existing methods, termed Hearing Impairment (HI), which refers to the scenario where the network localizes a specific object in the image regardless of the input audio. To measure the extent of HI, three additional audio-visual mismatched datasets (Un-VGGSS, Un-S4 and Un-AVSS) are constructed and a novel metric is introduced, which is combined with mIoU to evaluate sound source localization performance comprehensively. We trained using the latest six AV-SSL methods on VGGSound, S4, and AVSS datasets, then evaluated them on VGGSS, S4 Test, and AVSS Test. Results indicate that some specific methods perform well in localization, but fail to distinguish whether the visual object is producing the sound. Future work should incorporate the evaluation of the model’s ability to differentiate sounding objects, rather than only focus on localization accuracy.
KW - Audio-visual
KW - Hearing Impairment
KW - Sound Source Localization
KW - Sounding Object
UR - https://www.scopus.com/pages/publications/105035496884
U2 - 10.1109/ICICN67355.2025.11430447
DO - 10.1109/ICICN67355.2025.11430447
M3 - 会议稿件
AN - SCOPUS:105035496884
T3 - 2025 13th International Conference on Information and Communication Networks, ICICN 2025
SP - 39
EP - 44
BT - 2025 13th International Conference on Information and Communication Networks, ICICN 2025
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 13th International Conference on Information and Communication Networks, ICICN 2025
Y2 - 8 August 2025 through 11 August 2025
ER -