?
Intonation Control in Speech Synthesis for Game Voiceovers
Voicing game cues is one of the important tasks in game development. To avoid the need to record the voice of live
actors, neural network speech synthesis technologies can be used. Despite the rapid development of generative technologies, synthesized voices often lack the variety of intonations inherent in human speech. The problem with "one-to-many" is that one text may have several suitable intonation options. The model averages expressiveness during training, which leads to a robotic sound.
The solution to this problem is possible by using prosody control to introduce variability into synthesized speech. Existing approaches often do not take into account all prosodic characteristics, or require manual marking, creating problems in controlling intonation after training. The main task of the work is to develop a method of uncontrolled prosody modeling based on acoustic and linguistic characteristics using discrete markup to improve parametric speech synthesis for voicing game.