Abstract
In this paper, we propose a neural end-to-end system for voice preserving and lip-synchronous video translation. The system is designed to combine multiple component models and produces a video of the original speaker speaking in the target language that is lip-synchronous with the target speech, yet maintains emphases in speech, voice characteristics, and face video of the original speaker. The result is a video of a speaker speaking in another language without actually knowing it. For the evaluation, we present a user study of the complete system and separate evaluations of the single components. Since there is no available dataset to evaluate our whole system, we collect a test set to evaluate our system. The results indicate that our system is able to generate convincing videos of the original speaker speaking the target language while preserving the original speaker's characteristics.
Original language | English |
---|---|
Title of host publication | ICASSPW 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing Workshops, Proceedings |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
ISBN (Electronic) | 9798350302615 |
DOIs | |
Publication status | Published - 2023 |
Event | 2023 IEEE International Conference on Acoustics, Speech and Signal Processing Workshops, ICASSPW 2023 - Rhodes Island, Greece Duration: 4 Jun 2023 → 10 Jun 2023 |
Publication series
Name | ICASSPW 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing Workshops, Proceedings |
---|
Conference
Conference | 2023 IEEE International Conference on Acoustics, Speech and Signal Processing Workshops, ICASSPW 2023 |
---|---|
Country/Territory | Greece |
City | Rhodes Island |
Period | 4/06/23 → 10/06/23 |
Bibliographical note
Publisher Copyright:© 2023 IEEE.
Keywords
- end-to-end video translation
- lip generation
- speech translation
- text-to-speech
- voice conversion