Gender domain adaptation for automatic speech recognition
This paper is focused on the finetuning of acoustic models for speaker adaptation goals on a given gender. We pretrained the Transformer baseline model on Librispeech-960 and conducted experiments with finetuning on the gender-specific test subsets. The obtained word error rate (WER) relatively to the baseline is up to 5% and 3% lower on male and female subsets, respectively, if the layers in the encoder and decoder are not frozen, and the tuning is started from the last checkpoints. Moreover, we adapted our base model on the complete L2 Arctic dataset of accented speech and finetuned it for particular speakers and male and female genders separately. The models trained on the gender subsets obtained 1-2% lower WER when compared to the model tuned on the whole L2 Arctic dataset. Finally, it was experimentally confirmed that the concatenation of the pretrained voice embeddings (x-vector) and embeddings from a conventional encoder cannot significantly improve the speech recognition accuracy.
Since the early 1990s, speaker adaptation have become one of the intensive areas in speech recognition. State-of-the-art batch-mode adaptation algorithms assume that speech of particular speaker contains enough information about the user's voice. In this article we propose to allow the user to manually verify if the adaptation is useful. Our procedure requires the speaker to pronounce syllables containing each vowel of particular language. The algorithm contains two steps looping through all syllables. At first, LPC analysis is performed for extracted vowel and the LPC coefficients are used to synthesize the new sound (with a fixed pitch period) and play it. If this synthesized sound is not perceived by the user as an original one then the syllable should be recorded again. At the second stage, speaker is asked to produce another syllable with the same vowel to automatically verify the stability of pronunciation. If two signals are closed (in terms of the Itakura-Saito divergence) then the sounds are marked as "good" for adaptation. Otherwise both steps are repeated. In the experiment we examine a problem of vowel recognition for Russian language in our voice control system which fuses two classifiers: the CMU Sphinx with speaker-independent acoustic model and Euclidean comparison of MFCC features of model vowel and input signal frames. Our results support the statement that the proposed approach provides better accuracy and reliability in comparison with traditional MAP/MLLR techniques implemented in the CMU Sphinx.
If the training data set in image recognition task is not very large, the feature extraction with a convolutional neural network is usually applied. Here, we focus on the nonparametric classification of extracted feature vectors using the probabilistic neural network (PNN). The latter is characterized by the high runtime and memory space complexity. We propose to overcome these drawbacks by replacing the exponential activation function in the Gaussian kernel to the complex exponential functions. Such complex nonlinearities make it possible to accurately approximate the unknown density function using the network with the number of neurons proportional to only cubic root of the database size. As a result, the proposed approach decreases the runtime and memory complexities of the PNN without losing its main advantages, namely, fast training and convergence to the Bayesian decision. In the experimental study, we describe a protocol for comparing recognition methods using the well-known visual object category data sets in the context of the small sample size problem. It has been experimentally shown that our approach rapidly obtains accurate decisions when compared to the known classifiers including the baseline PNN.
In this article, we focus on the isolated voice command recognition for autonomous man-machine and intelligent robotic systems. We propose to create a grammar model for a small testing command set with self-loops for each state to return blank symbols for noise and out-of-vocabulary words. In addition, we use single arc connected beginning and ending of the grammar in order to filter unknown commands. As a result, the grammar is resistant to distortions and unexpected words near or inside of command. We implemented the proposed approach using Finite State Transducers in the Kaldi framework and examined it using self-recorded noised data with various level of signal-to-noise ratio. We compared recognition accuracy and average decision-making time of our approach with the state-of-the-art continuous speech recognition engines based on language models. It was experimentally shown that our approach is characterized by up to 60% higher accuracy than conventional offline speech recognition methods based on language models. The speed of utterance recognition is 3 times higher than speed of traditional continuous speech recognition algorithms.
In this paper, we consider the problem of insufficient runtime and memory-space complexities of contemporary deep convolutional neural networks in the problem of image recognition. A survey of recent compression methods and efficient neural networks architectures is provided. The experimental study is focused on the visual emotion recognition problem. We compare the computational speed and memory consumption during the training and the inference stages of such methods as the weights matrix decomposition, binarization and hashing in the visual emotion recognition problem. It is experimentally shown that the most efficient recognition is achieved with the full network binarization and matrices decomposition.
In the process of astronomical observations are collected vast amounts of data. BSA (Big Scanning Antenna) LPI used in the study of impulse phenomena, daily logs 87.5 GB of data (32 TB per year). Experts classified 83096 individual observations (on the segment of the study July 2012 - October 2013). Over 75% of the sample correspond to pulsars, twinkling springs and rapid radiotransmitter, and all other classes of observations belong to hardware failures, interference, the flight of the Earth satellite and aircraft. There were allocated 15 classes of observations.
Such a sample, divided into classes allows using the machine learning algorithms. It has become possible to develop an automated service for short-term/long-term monitoring of various classes of radio sources (including radiotransmitted different nature), monitoring the Earth's ionosphere, the interplanetary and the interstellar plasma, the search and monitoring of different classes of radio sources. Monitoring in this case refers to the automatic filtering and detection of a previously unclassified impulse phenomena.
Currently, for automatic filtering, statistical analysis methods are used. This report examines an alternative method supposed to be using neural network machine learning algorithm that processes the input into raw data and after processing by the hidden layer through the output layer determines the class of pulse phenomena.
Creating a neural network model, trained on a sample and performing a classification of previously unclassified impulse phenomena is performed using the cloud service Microsoft Azure Machine Learning Studio. The Web service has been created based on the model allows classifying single impulse phenomena in real time (Request / Reply) and data sampling for a certain period (Batch processing).
Proceedings of the 6th International Conference on Learning Representations (ICLR 2018)
This two-volume set LNCS 10305 and LNCS 10306 constitutes the refereed proceedings of the 15th International Work-Conference on Artificial Neural Networks, IWANN 2019, held at Gran Canaria, Spain, in June 2019. The 150 revised full papers presented in this two-volume set were carefully reviewed and selected from 210 submissions. The papers are organized in topical sections on machine learning in weather observation and forecasting; computational intelligence methods for time series; human activity recognition; new and future tendencies in brain-computer interface systems; random-weights neural networks; pattern recognition; deep learning and natural language processing; software testing and intelligent systems; data-driven intelligent transportation systems; deep learning models in healthcare and biomedicine; deep learning beyond convolution; artificial neural network for biomedical image processing; machine learning in vision and robotics; system identification, process control, and manufacturing; image and signal processing; soft computing; mathematics for neural networks; internet modeling, communication and networking; expert systems; evolutionary and genetic algorithms; advances in computational intelligence; computational biology and bioinformatics.