Supplementary Proceedings of the 8th International Conference on Analysis of Images, Social Networks and Texts. (AIST-SUP 2019), Kazan, Russia, July 17–19, 2019, Communications in Computer and Information Science
This volume contains the refereed proceedings of the 8th International Conference on Analysis of Images, Social Networks, and Texts (AIST 2019). The previous conferences during 2012–2018 attracted a significant number of data scientists – students, researchers, academics, and engineers working on interdisciplinary data analysis of images, texts, and social networks.
In recent works on learning representations for graph structures, methods have been proposed both for the representation of nodes and edges for large graphs, and for representation of graphs as a whole. This paper considers the popular graph2vec approach, which shows quite good results for ordinary graphs. In the field of natural language processing, however, a graph structure called a dependency tree is often used to express the connections between words in a sentence. We show that the graph2vec approach applied to dependency trees is unsatisfactory, which is due to the WL Kernel. In this paper, an adaptation of this kernel for dependency trees has been proposed, as well as 3 other types of kernels that take into account the specific features of dependency trees. This new vector representation can be used in NLP tasks where it is important to model syntax (e.g. authorship attribution, intention labeling, targeted sentiment analysis etc.). Universal Dependencies treebanks were clustered to show the consistency and validity of the proposed tree representation methods.
This paper is aimed at evaluating the performance of existing models of morphemic analysis for Russian based on convolutional neural networks. The models were trained on a relatively small amount of annotated training data (38,368 words). We tuned the hyperparameters to accommodate the harder task setting, which helped improve the accuracy of the model. In addition to testing 15 different configurations on the available test set, a new sample of 800 words containing roots that are missing in the training sample (e.g. neologisms and recent loan words) was manually created and annotated for morphemic structure (the new dataset is made available to the community). The effectiveness of the models was evaluated on this sample, and it turned out that the performance of the CNN models was much worse on this set (an almost 30% drop in word accuracy). We performed a classification of errors made by the best model both on the standard test set and the new one.
In this paper, a deep learning method study is conducted to solve a new multiclass text classification problem, identifying user interests by text messages. We used an original dataset of almost 90 thousand forum text messages, labeled for ten interests. We experimented with different modern neural network architectures: recurrent and convolutional, as well as simpler feedforward networks. Classification accuracy was evaluated for different architectures, text representations, and sets of miscellaneous parameters.