Morpheme Segmentation for Russian: Evaluation of Convolutional Neural Network Models
This paper is aimed at evaluating the performance of existing models of morphemic analysis for Russian based on convolutional neural networks. The models were trained on a relatively small amount of annotated training data (38,368 words). We tuned the hyperparameters to accommodate the harder task setting, which helped improve the accuracy of the model. In addition to testing 15 different configurations on the available test set, a new sample of 800 words containing roots that are missing in the training sample (e.g. neologisms and recent loan words) was manually created and annotated for morphemic structure (the new dataset is made available to the community). The effectiveness of the models was evaluated on this sample, and it turned out that the performance of the CNN models was much worse on this set (an almost 30% drop in word accuracy). We performed a classification of errors made by the best model both on the standard test set and the new one.