The paper considers the phoneme recognition by facial expressions of a speaker in voice-activated control systems. We have developed a neural network recognition algorithm by using the phonetic words decoding method and the requirement for isolated syllable pronunciation of voice commands. The paper presents the experimental results of viseme (facial and lip position corresponding to a particular phoneme) classification of Russian vowels. We show the dependence of the classification accuracy on the used classifier (multilayer feed-forward network, support vector machine, k-nearest neighbor method), image features (histogram of oriented gradients, eigenvectors, SURF local descriptors) and the type of camera (built-in or Kinect one). The best accuracy of speaker-dependent recognition is shown to be 85% for a built-in camera and 96% for Kinect depth maps when the classification is performed with the histogram of oriented gradients and the support vector machine.
An ensemble of classifiers has been built to solve the problem of video image recognition. The paper offers a way to estimate the a posteriori probability of an image belonging to a particular class in the case of an arbitrary distance and nearest neighbor method. The estimation is shown to be equivalent to the optimal naive Bayesian estimate given Kullback-Leibler divergence being used as proximity measure. The block diagram of a video image recognition system is presented. The system features automatic adaptation of the list of images of identical objects which is fed to the committee machine input. The system is tested in face recognition task using popular data bases (FERET, AT&T, Yale) and the results are discussed.
The goal of the study is to increase the computation efficiency of the face recognition that uses feature vectors to describe facial images on photos and videos. These high-dimensional feature vectors are nowadays produced by convolutional neural networks. The methods to aggregate the features generated for each video frame are used to process the video sequences. A novel hierarchical recognition algorithm is proposed. In contrast to known approaches our algorithm seeks the nearest neighbors only among reference images of most reliable classes selected at the preceding stage to carry out the sequential analysis of a more detailed description level (with a greater dimensionality of the feature vector). At each stage principal components are compared, the number of the components being chosen according to a given portion of explained variations. Datasets like Labeled Faces in the Wild, YouTubeFaces, IAPRA, Jenus Benchmark C and different neural-net face descriptors are used to compare the algorithm with other methods. In contrast with the conventional nearest-neighbor method, the proposed approach is shown to allow a 2- to 16-times reduction of the classifier running time.
We analyzed the way to increase computational efficiency of video-based image recognition methods with matching of high dimensional feature vectors extracted by deep convolutional neural networks. We proposed an algorithm for approximate nearest neighbor search. At the first step, for a given video frame the algorithm verifies a reference image obtained when recognizing the previous frame. After that the frame is compared with a few number of reference images. Each next examined reference image is chosen so that to maximize conditional probability density of distances to the reference instances tested at previous steps. To decrease the required memory space we beforehand calculate only distances from all the images to small number of instances (pivots). When experimenting with either face photos from Labeled Faces in the Wild and PubFig83 datasets or with video data from YouTube Faces we showed that our algorithm allows accelerating the recognition procedure by 1.4–4 times comparing with known approximate nearest neighbor methods.
We discuss the video classification problem with the matching of feature vectors extracted using deep convolutional neural networks from each frame. We propose the novel recognition method based on representation of each frame as a sequence of fuzzy sets of reference classes whose degrees of membership are defined based on asymptotic distribution of the Kullback–Leibler information divergence and its relation with the maximum likelihood method. In order to increase the classification accuracy, we perform the fuzzy intersection (product triangular norms) of these sets. Experimental study with YTF (YouTube Faces) and IJB-A (IARPA Janus Benchmark A) video datasets and VGGFace, ResFace and LightCNN descriptors shows that the proposed approach allows us to increase the accuracy of recognition by 2–6% compering with the known classification methods.
In this paper, we analyze effective methods of multi-label classification of image sets in development of visual recommender systems. We propose a two-step algorithm, which at the first step performs fine-tuning of a convolutional neural network for extraction of visual features. At the second stage, the algorithm concatenates the obtained feature vectors of each image from the input set into one descriptor using modifications of a neural aggregation module based on linear squeezing of the feature space and an attention mechanism. We perform an experimental study for the dataset Amazon Product Data solving a problem of classification of customer interests based on photos of the products they have purchased. We show that one of the highest F1-measure indicators can be achieved for a one-level attention block with squeezing of the feature vectors.
The paper considers the use of convolutional neural networks for the concurrent recognition of the gender and age of a person by video records of his face. The emphasis is on the incorporation of the approach into mobile video-recording software. We have investigated the fusion of decisions obtained during the processing of each video frame, including the use of the classifier committee based on Dempster–Shafer theory. We propose the novel age prediction method using the evaluation of the expectation of the most probable ages. We have compared existing neural-net models with a specially trained modification of the MobileNet convolution network with two outputs. The experimental results are given for such data collections as Kinect, IJB-A, Indian Movie and EmotiW. As compared with other conventional methods, our approach makes it possible to increase the age and sex recognition accuracy by 2-5% and 5-10% respectively.
The research subject is the computational complexity of the probabilistic neural network (PNN) in the pattern recognition problem for large model databases. We examined the following methods of increasing the efficiency of a neuralnetwork classifier: a parallel multithread realization, reducing the PNN to a criterion with testing of homogeneity of feature histograms of input and reference images, approximate nearestneighbor analyses (BestBin First, directed enumeration methods). The approach was tested in facialrecognition experiments with FERET dataset.
Decision support in equipment condition monitoring systems with image processing is analyzed. Long-run accumulation of information about earlier made decisions is used to realize the adaptiveness of the proposed approach. It is shown that unlike conventional classification problems, the recognition of abnormalities uses training samples supplemented with reward estimates of earlier decisions and can be tackled using reinforcement learning algorithms. We consider the basic stages of contextual multi-armed bandit algorithms during which the probabilistic distributions of each state are evaluated to evaluate the current knowledge of the states, and the decision space is explored to increase the decision-making efficiency. We propose a new decision-making method, which uses the probabilistic neural network to classify abnormal situation and the softmax rule to explore the decision space. A modelling experiment in image processing was carried out to show that our approach allows a higher accuracy of abnormality detection than other known methods, especially for small-size initial training samples.