Semi-automated Speaker Adaptation: How to Control the Quality of Adaptation?
Since the early 1990s, speaker adaptation have become one of the intensive areas in speech recognition. State-of-the-art batch-mode adaptation algorithms assume that speech of particular speaker contains enough information about the user's voice. In this article we propose to allow the user to manually verify if the adaptation is useful. Our procedure requires the speaker to pronounce syllables containing each vowel of particular language. The algorithm contains two steps looping through all syllables. At first, LPC analysis is performed for extracted vowel and the LPC coefficients are used to synthesize the new sound (with a fixed pitch period) and play it. If this synthesized sound is not perceived by the user as an original one then the syllable should be recorded again. At the second stage, speaker is asked to produce another syllable with the same vowel to automatically verify the stability of pronunciation. If two signals are closed (in terms of the Itakura-Saito divergence) then the sounds are marked as "good" for adaptation. Otherwise both steps are repeated. In the experiment we examine a problem of vowel recognition for Russian language in our voice control system which fuses two classifiers: the CMU Sphinx with speaker-independent acoustic model and Euclidean comparison of MFCC features of model vowel and input signal frames. Our results support the statement that the proposed approach provides better accuracy and reliability in comparison with traditional MAP/MLLR techniques implemented in the CMU Sphinx.