About neural-network algorithms application in viseme classification problem with face video in audiovisual speech recognition systems
The paper considers the phoneme recognition by facial expressions of a speaker in voice-activated control systems. We have developed a neural network recognition algorithm by using the phonetic words decoding method and the requirement for isolated syllable pronunciation of voice commands. The paper presents the experimental results of viseme (facial and lip position corresponding to a particular phoneme) classification of Russian vowels. We show the dependence of the classification accuracy on the used classifier (multilayer feed-forward network, support vector machine, k-nearest neighbor method), image features (histogram of oriented gradients, eigenvectors, SURF local descriptors) and the type of camera (built-in or Kinect one). The best accuracy of speaker-dependent recognition is shown to be 85% for a built-in camera and 96% for Kinect depth maps when the classification is performed with the histogram of oriented gradients and the support vector machine.