Touching the Limits of a Dataset in Video-Based Facial Expression Recognition
In this paper, we examine the issue of video-based facial emotion recognition algorithms which show excellent performance on some benchmarks, but have much worse accuracy in practical applications. For example, the typical error rate of contemporary deep neural networks on the RAVDESS dataset is less than 5%. We argue that such results are obtained only if the split of the whole dataset is incorrect, so that the same persons are present in both training and test sets. It is claimed that it is more frankly to use the actor-based split, in which persons in the training and test sets are disjoint. It is experimentally demonstrated that the near state-of-the-art neural network model pre-trained on the AffectNet dataset achieves 99% accuracy on conventional split of the RAVDESS dataset. However, when we split the dataset by the actors and training and testing sets have only unique persons then the accuracy will be 20-30% lower.
Previous works show that mood congruence effect or trait congruence effect can be achieved (Chepenik et al., 2007; Rusting, 1998). The present study explores the effect of emotional state and dispositional joy on effectiveness of emotion recognition from facial expression. The experimental study was conducted in two groups of subjects. The general sample consisted of 39 participants. Participants’ emotional state was measured with the self-report questionnaire PANAS. The participants’ current mood was manipulated with the emotion induction procedure, which involved screening video with “joyful” or “neutral” emotional coloring. To measure the speed of emotional information processing a computer technique was used, in which a participant performed the task on emotion recognition from facial expression. The hypothesis was tested whether there is an effect of congruency in positive information processing. It was supposed that positive emotional state and dispositional joy heighten the speed of positive information processing and don’t influence processing of the stimuli with negative emotional coloring. Testing of the emotion induction procedure proved it to be partially successful. Congruency effect for dispositional joy was achieved: we found an interrelation of higher manifestation of this trait with higher speed in joy recognition from facial expressions. The influence of positive emotional state was manifested in lower speed in recognition of joy. In sum, the results show that the congruency effect is expressed differently for trait and emotional state. Overall, the results of the conducted study provide information on the mechanisms of emotion recognition.
The article describes an approach for extraction of user preferences based on the analysis of a gallery of photos and videos on mobile device. It is proposed to firstly use fast SSD-based methods in order to detect objects of interests in offline mode directly on mobile device. Next we perform facial analysis of all visual data: extract feature vectors from detected facial regions, cluster them and select public photos and videos which do not contain faces from the large clusters of an owner of mobile device and his or her friends and relatives. At the second stage, these public images are processed on the remote server using very accurate but rather slow object detectors. Experimental study of several contemporary detectors is presented with the specially designed subset of MS COCO, ImageNet and Open Images datasets.
We propose a novel multi-texture synthesis model based on generative adversarial networks (GANs) with a user-controllable mechanism. The user control ability allows to explicitly specify the texture which should be generated by the model. This property follows from using an encoder part which learns a latent representation for each texture from the dataset. To ensure a dataset coverage, we use an adversarial loss function that penalizes for incorrect reproductions of a given texture. In experiments, we show that our model can learn descriptive texture manifolds for large datasets and from raw data such as a collection of high-resolution photos. We show our unsupervised learning pipeline may help segmentation models. Moreover, we apply our method to produce 3D textures and show that it outperforms existing baselines.
The EPiC Series in Language and Linguistics publishes high quality collections of papers in language, linguistics and related areas.
his volume presents the results of the Neural Information Processing Systems Competition track at the 2018 NeurIPS conference. The competition follows the same format as the 2017 competition track for NIPS. Out of 21 submitted proposals, eight competition proposals were selected, spanning the area of Robotics, Health, Computer Vision, Natural Language Processing, Systems and Physics.
Competitions have become an integral part of advancing state-of-the-art in artificial intelligence (AI). They exhibit one important difference to benchmarks: Competitions test a system end-to-end rather than evaluating only a single component; they assess the practicability of an algorithmic solution in addition to assessing feasibility.
The paper focuses on the way one’s own emotional state influences the recognition of other people’s emotions. Existing research indicates the effect of congruence between the emotions experienced at the moment and the evaluations of emotional stimuli. Our experimental study tested the hypotheses of the influence of emotional states on two aspects of emotion recognition, accuracy and sensitivity. We hypothesized that emotional state of the observer reduces the accuracy and increases the sensitivity. The study involved 69 participants divided into three groups. The baseline emotional state was assessed using a self-report measure. We used video clips with neutral, positive, and negative emotional content to induce different emotional states in each group. The accuracy and sensitivity of emotion recognition were measured using a test based on video samples of people's behavior in different situations. The results showed that the emotional state in the control group was rather «tense» and different from neutral. However, our hypotheses were not supported: the groups with different induced emotional states did not exhibit any significant differences in the accuracy of emotion recognition. The control group demonstrated higher sensitivity. These preliminary results are discussed in the context of the issues of emotion recognition research (such as emotion induction, assessment of emotions, differentiation of emotional states and traits).
It has been shown that the activations invoked by an image within the top layers of a large convolutional neural network provide a high-level descriptor of the visual content of the image. In this paper, we investigate the use of such descriptors (neural codes) within the image retrieval application. In the experiments with several standard retrieval benchmarks, we establish that neural codes perform competitively even when the convolutional neural network has been trained for an unrelated classification task (e.g. Image-Net). We also evaluate the improvement in the retrieval performance of neural codes, when the network is retrained on a dataset of images that are similar to images encountered at test time. We further evaluate the performance of the compressed neural codes and show that a simple PCA compression provides very good short codes that give state-of-the-art accuracy on a number of datasets. In general, neural codes turn out to be much more resilient to such compression in comparison other state-of-the-art descriptors. Finally, we show that discriminative dimensionality reduction trained on a dataset of pairs of matched photographs improves the performance of PCA-compressed neural codes even further. Overall, our quantitative experiments demonstrate the promise of neural codes as visual descriptors for image retrieval.
The paper reviews the problem of age and gender recognition methods for video data using modern deep convolutional neural networks. We present the comparative analysis of classifier fusion algorithms to aggregate decisions for individual frames. We implemented the video-based recognition system with several aggregation methods to improve the age and gender identification accuracy. The experimental comparison of the proposed approach with traditional simple voting using IJB-A, Indian Movies, and Kinect datasets is provided. It is demonstrated that the most accurate decisions are obtained using the geometric mean and mathematical expectation of the outputs at softmax layers of the convolutional neural networks for gender recognition and age prediction, respectively.