TY - GEN
T1 - Multi-talker speech recognition under ego-motion noise using missing feature theory
AU - Ince, Gökhan
AU - Nakadai, Kazuhiro
AU - Rodemann, Tobias
AU - Tsujino, Hiroshi
AU - Imura, Jun Ichi
PY - 2010
Y1 - 2010
N2 - This paper presents a system that gives a mobile robot the ability to recognize target speaker's speech, even if the robot performs an action and there are multiple speakers talking in the room. Associated problems to this system are twofold: (1) While the robot is moving, the joints inevitably generate ego-motion noise due to its motors. (2) Recognizing target speech against other interfering speech signals is a difficult task. Since typical solutions to (1) and (2), motor noise suppression and sound source separation, both introduce distortion to the processed signals, the performance of automatic speech recognition (ASR) deteriorates. Instead of removing the ego-motion noise with conventional noise suppression methods, in this work, we investigate methods to eliminate the unreliable parts of the audio features that are contaminated by the ego-motion noise. For this purpose, we model masks that filter unreliable speech features based on the ratio of speech and motor noise energies. We analyze the performance of the proposed technique under various test conditions by comparing it to the performance of existing Missing Feature Theory-based ASR implementations. Finally, we propose an integration framework for two different masks that are designed to eliminate ego noise and to filter the leakage energy of interfering sound sources. We demonstrate that the proposed methods achieve a high ASR accuracy.
AB - This paper presents a system that gives a mobile robot the ability to recognize target speaker's speech, even if the robot performs an action and there are multiple speakers talking in the room. Associated problems to this system are twofold: (1) While the robot is moving, the joints inevitably generate ego-motion noise due to its motors. (2) Recognizing target speech against other interfering speech signals is a difficult task. Since typical solutions to (1) and (2), motor noise suppression and sound source separation, both introduce distortion to the processed signals, the performance of automatic speech recognition (ASR) deteriorates. Instead of removing the ego-motion noise with conventional noise suppression methods, in this work, we investigate methods to eliminate the unreliable parts of the audio features that are contaminated by the ego-motion noise. For this purpose, we model masks that filter unreliable speech features based on the ratio of speech and motor noise energies. We analyze the performance of the proposed technique under various test conditions by comparing it to the performance of existing Missing Feature Theory-based ASR implementations. Finally, we propose an integration framework for two different masks that are designed to eliminate ego noise and to filter the leakage energy of interfering sound sources. We demonstrate that the proposed methods achieve a high ASR accuracy.
UR - http://www.scopus.com/inward/record.url?scp=78651479762&partnerID=8YFLogxK
U2 - 10.1109/IROS.2010.5650112
DO - 10.1109/IROS.2010.5650112
M3 - Conference contribution
AN - SCOPUS:78651479762
SN - 9781424466757
T3 - IEEE/RSJ 2010 International Conference on Intelligent Robots and Systems, IROS 2010 - Conference Proceedings
SP - 982
EP - 987
BT - IEEE/RSJ 2010 International Conference on Intelligent Robots and Systems, IROS 2010 - Conference Proceedings
T2 - 23rd IEEE/RSJ 2010 International Conference on Intelligent Robots and Systems, IROS 2010
Y2 - 18 October 2010 through 22 October 2010
ER -