Russian Q&A Method Study: From Naive Bayes to Convolutional Neural Networks
This paper deals with automatic classification of questions in the Russian language. In contrast to previously used methods, we introduce a convolutional neural network for question classification. We took advantage of an existing corpus of 2008 questions, manually annotated in accordance with a pragmatic 14-class typology. We modified the data by reducing the typology to 13 classes, expanding the dataset and improving the representativeness of some of the question types. The training data in a combined representation of word embeddings and binary regular expression-based features was used for supervised learning to approach the task of question tagging. We tested a convolutional neural network against a state-of-the-art Russian language question classification algorithm, an SVM classifier with a linear kernel and questions represented as word trigram counts, as the baseline model (60.22% accuracy on the new dataset). We also tested several widely-used machine learning methods (logistic regression, Bernoulli Naïve Bayes) trained on the new question representation. The best result of 72.38% accuracy (micro) was achieved with the CNN model. We also ran experiments on pertinent feature selection with a simple Multinomial Naïve Bayes classifier, using word features only, Add-1 smoothing and no strategy for out-of-vocabulary words. Surprisingly, the setting with top-1200 informative word features (by PPMI) and equal priors achieved only slightly lower accuracy, 70.72%, which also beats the baseline by a large margin.