TY - GEN
T1 - Speech pattern discovery using audio-visual fusion and canonical correlation analysis
AU - Xie, Lei
AU - Xu, Yinqing
AU - Zheng, Lilei
AU - Huang, Qiang
AU - Li, Bingfeng
PY - 2012
Y1 - 2012
N2 - In this paper, we address the problem of automatic discovery of speech patterns using audio-visual information fusion. Unlike those previous studies based on single audio modality, our work not only uses the acoustic information, but also takes into account the visual features extracted from the mouth region. To improve the effectiveness of the use of multimodal information, several audio-visual fusion strategies, including feature concatenation, similarity weighting and decision fusion, are utilized. Specifically, our decision fusion approach retains the reliable patterns discovered in the audio and visual modalities. Moreover, we use canonical correlation analysis (CCA) to address the issue of temporal asynchrony between audio and visual speech modalities and unbounded dynamic time warping (UDTW) is adopted to search for the speech patterns through audio and visual similarity matrices calculated on the aligned audio and visual sequence. Experiments on an audio-visual corpus show that, for the first time, speech pattern discovery can be improved by the use of visual information. The decision fusion approach shows superior performance compared with standard feature concatenation and similarity weighting. CCA-based audio-visual synchronization plays an important role in the performance improvement.
AB - In this paper, we address the problem of automatic discovery of speech patterns using audio-visual information fusion. Unlike those previous studies based on single audio modality, our work not only uses the acoustic information, but also takes into account the visual features extracted from the mouth region. To improve the effectiveness of the use of multimodal information, several audio-visual fusion strategies, including feature concatenation, similarity weighting and decision fusion, are utilized. Specifically, our decision fusion approach retains the reliable patterns discovered in the audio and visual modalities. Moreover, we use canonical correlation analysis (CCA) to address the issue of temporal asynchrony between audio and visual speech modalities and unbounded dynamic time warping (UDTW) is adopted to search for the speech patterns through audio and visual similarity matrices calculated on the aligned audio and visual sequence. Experiments on an audio-visual corpus show that, for the first time, speech pattern discovery can be improved by the use of visual information. The decision fusion approach shows superior performance compared with standard feature concatenation and similarity weighting. CCA-based audio-visual synchronization plays an important role in the performance improvement.
KW - Audio-visual speech processing
KW - Canonical correlation analysis
KW - Dynamic time warping
KW - Speech pattern discovery
UR - http://www.scopus.com/inward/record.url?scp=84878554458&partnerID=8YFLogxK
M3 - 会议稿件
AN - SCOPUS:84878554458
SN - 9781622767595
T3 - 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012
SP - 2371
EP - 2374
BT - 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012
T2 - 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012
Y2 - 9 September 2012 through 13 September 2012
ER -