Speech pattern discovery using audio-visual fusion and canonical correlation analysis

Lei Xie; Yinqing Xu; Lilei Zheng; Qiang Huang; Bingfeng Li

Speech pattern discovery using audio-visual fusion and canonical correlation analysis

Lei Xie, Yinqing Xu, Lilei Zheng, Qiang Huang, Bingfeng Li

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

1 引用（Scopus）

摘要

In this paper, we address the problem of automatic discovery of speech patterns using audio-visual information fusion. Unlike those previous studies based on single audio modality, our work not only uses the acoustic information, but also takes into account the visual features extracted from the mouth region. To improve the effectiveness of the use of multimodal information, several audio-visual fusion strategies, including feature concatenation, similarity weighting and decision fusion, are utilized. Specifically, our decision fusion approach retains the reliable patterns discovered in the audio and visual modalities. Moreover, we use canonical correlation analysis (CCA) to address the issue of temporal asynchrony between audio and visual speech modalities and unbounded dynamic time warping (UDTW) is adopted to search for the speech patterns through audio and visual similarity matrices calculated on the aligned audio and visual sequence. Experiments on an audio-visual corpus show that, for the first time, speech pattern discovery can be improved by the use of visual information. The decision fusion approach shows superior performance compared with standard feature concatenation and similarity weighting. CCA-based audio-visual synchronization plays an important role in the performance improvement.

源语言	英语
主期刊名	13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012
页	2371-2374
页数	4
出版状态	已出版 - 2012
活动	13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012 - Portland, OR, 美国期限: 9 9月 2012 → 13 9月 2012

出版系列

姓名	13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012
卷	3

会议

会议	13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012
国家/地区	美国
市	Portland, OR
时期	9/09/12 → 13/09/12

其它文件与链接

链接到 Scopus 的出版物

引用此

Xie, L., Xu, Y., Zheng, L., Huang, Q., & Li, B. (2012). Speech pattern discovery using audio-visual fusion and canonical correlation analysis. 在 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012 (页码 2371-2374). (13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012; 卷 3).

@inproceedings{94a02fb63aba49878e3ff4beb7c3b3a5,

title = "Speech pattern discovery using audio-visual fusion and canonical correlation analysis",

abstract = "In this paper, we address the problem of automatic discovery of speech patterns using audio-visual information fusion. Unlike those previous studies based on single audio modality, our work not only uses the acoustic information, but also takes into account the visual features extracted from the mouth region. To improve the effectiveness of the use of multimodal information, several audio-visual fusion strategies, including feature concatenation, similarity weighting and decision fusion, are utilized. Specifically, our decision fusion approach retains the reliable patterns discovered in the audio and visual modalities. Moreover, we use canonical correlation analysis (CCA) to address the issue of temporal asynchrony between audio and visual speech modalities and unbounded dynamic time warping (UDTW) is adopted to search for the speech patterns through audio and visual similarity matrices calculated on the aligned audio and visual sequence. Experiments on an audio-visual corpus show that, for the first time, speech pattern discovery can be improved by the use of visual information. The decision fusion approach shows superior performance compared with standard feature concatenation and similarity weighting. CCA-based audio-visual synchronization plays an important role in the performance improvement.",

keywords = "Audio-visual speech processing, Canonical correlation analysis, Dynamic time warping, Speech pattern discovery",

author = "Lei Xie and Yinqing Xu and Lilei Zheng and Qiang Huang and Bingfeng Li",

year = "2012",

language = "英语",

isbn = "9781622767595",

series = "13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012",

pages = "2371--2374",

booktitle = "13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012",

note = "13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012 ; Conference date: 09-09-2012 Through 13-09-2012",

}

Xie, L, Xu, Y, Zheng, L, Huang, Q & Li, B 2012, Speech pattern discovery using audio-visual fusion and canonical correlation analysis. 在 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012. 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012, 卷 3, 页码 2371-2374, 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012, Portland, OR, 美国, 9/09/12.

Speech pattern discovery using audio-visual fusion and canonical correlation analysis. / Xie, Lei; Xu, Yinqing; Zheng, Lilei 等.
13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012. 2012. 页码 2371-2374 (13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012; 卷 3).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Speech pattern discovery using audio-visual fusion and canonical correlation analysis

AU - Xie, Lei

AU - Xu, Yinqing

AU - Zheng, Lilei

AU - Huang, Qiang

AU - Li, Bingfeng

PY - 2012

Y1 - 2012

N2 - In this paper, we address the problem of automatic discovery of speech patterns using audio-visual information fusion. Unlike those previous studies based on single audio modality, our work not only uses the acoustic information, but also takes into account the visual features extracted from the mouth region. To improve the effectiveness of the use of multimodal information, several audio-visual fusion strategies, including feature concatenation, similarity weighting and decision fusion, are utilized. Specifically, our decision fusion approach retains the reliable patterns discovered in the audio and visual modalities. Moreover, we use canonical correlation analysis (CCA) to address the issue of temporal asynchrony between audio and visual speech modalities and unbounded dynamic time warping (UDTW) is adopted to search for the speech patterns through audio and visual similarity matrices calculated on the aligned audio and visual sequence. Experiments on an audio-visual corpus show that, for the first time, speech pattern discovery can be improved by the use of visual information. The decision fusion approach shows superior performance compared with standard feature concatenation and similarity weighting. CCA-based audio-visual synchronization plays an important role in the performance improvement.

AB - In this paper, we address the problem of automatic discovery of speech patterns using audio-visual information fusion. Unlike those previous studies based on single audio modality, our work not only uses the acoustic information, but also takes into account the visual features extracted from the mouth region. To improve the effectiveness of the use of multimodal information, several audio-visual fusion strategies, including feature concatenation, similarity weighting and decision fusion, are utilized. Specifically, our decision fusion approach retains the reliable patterns discovered in the audio and visual modalities. Moreover, we use canonical correlation analysis (CCA) to address the issue of temporal asynchrony between audio and visual speech modalities and unbounded dynamic time warping (UDTW) is adopted to search for the speech patterns through audio and visual similarity matrices calculated on the aligned audio and visual sequence. Experiments on an audio-visual corpus show that, for the first time, speech pattern discovery can be improved by the use of visual information. The decision fusion approach shows superior performance compared with standard feature concatenation and similarity weighting. CCA-based audio-visual synchronization plays an important role in the performance improvement.

KW - Audio-visual speech processing

KW - Canonical correlation analysis

KW - Dynamic time warping

KW - Speech pattern discovery

UR - http://www.scopus.com/inward/record.url?scp=84878554458&partnerID=8YFLogxK

M3 - 会议稿件

AN - SCOPUS:84878554458

SN - 9781622767595

T3 - 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012

SP - 2371

EP - 2374

BT - 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012

T2 - 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012

Y2 - 9 September 2012 through 13 September 2012

ER -

Speech pattern discovery using audio-visual fusion and canonical correlation analysis

摘要

出版系列

会议

其它文件与链接

指纹

引用此