?
Автоматическая лингвистическая разметка китайских текстов, содержащих заимствования: словоделение, транскрипция, PoS-тэггинг
The article tackles the problems of linguistic annotation of the Chinese texts presented in the Russian Chinese Parallel Corpus of RNC (hereafter – our corpus), and the ways to solve them. Particular attention is paid to the Rus - sian loanwords in the texts, as they, firstly, are abundant in our corpus, secondly, are of interest as the cases of both out-of-vocabulary and code-switching problems. We describe our experiments in three fields, namely, word segmentation, grapheme-to-phoneme conversion, and PoS-tagging. In order to test the algorithms on our specific data, we created our own datasets based on the corpus, which can be precious for the following research in the field of processing the non-standard Chinese texts. As the main aim of the research is to improve the quality of the annotation in our corpus, we plan to implement the results of our work in the preprocessing pipeline of the new texts in the corpus.