Автоматическая лингвистическая разметка китайских текстов, содержащих заимствования: словоделение, транскрипция, PoS-тэггинг

АБВ
АБВ
АБВ

Обычная версия сайта

Priority areas

by year

Subject

May 25, 2026

HSE Scientists Train Neural Network to 'Hear' Faults in Electric Motors

Researchers at the AI and Digital Science Institute of the HSE Faculty of Computer Science have developed a new method—the Signature-Guided Data Augmentation (SGDA) framework—that achieves 99% accuracy in motor fault detection and 86% accuracy in fault classification. The application of this approach can reduce industrial equipment repair costs, minimise downtime, and improve production safety. The study results have been published in Engineering Applications of Artificial Intelligence.

May 25, 2026

'The Humanities Serve as a Conscience'

Maria Mizernaia studies Soviet literature and the history of book publishing. In this interview for the HSE Young Scientists project, she discusses plans to publish a novel about besieged Leningrad, AI-provoked reflections on what it means to be human, and how novels can help satisfy our dopamine hunger.

May 25, 2026

Is It Possible to Predict a Citys Life Based on the Shape of Its Neighbourhoods?

Is it possible to predict, based on the configuration of streets and buildings, where a café will open or where traffic congestion will occur? Participants in the Spatial Analysis and Modelling of Urban Processes research and study group use open data and machine learning to identify universal patterns. Alexander Sheludkov and Eduard Somov discuss the purpose of comparing cities, the need for new forms of urban statistics, and how open data is transforming approaches to urban studies.

Have you spotted a typo?
Highlight it, click Ctrl+Enter and send us a message. Thank you for your help!

Publications

?

Автоматическая лингвистическая разметка китайских текстов, содержащих заимствования: словоделение, транскрипция, PoS-тэггинг

С. 1081–1094.

Konovalova A., Volf E., Семенов К. И., Korotkova Y.

The article tackles the problems of linguistic annotation of the Chinese texts presented in the Russian Chinese Parallel Corpus of RNC (hereafter – our corpus), and the ways to solve them. Particular attention is paid to the Rus - sian loanwords in the texts, as they, firstly, are abundant in our corpus, secondly, are of interest as the cases of both out-of-vocabulary and code-switching problems. We describe our experiments in three fields, namely, word segmentation, grapheme-to-phoneme conversion, and PoS-tagging. In order to test the algorithms on our specific data, we created our own datasets based on the corpus, which can be precious for the following research in the field of processing the non-standard Chinese texts. As the main aim of the research is to improve the quality of the annotation in our corpus, we plan to implement the results of our work in the preprocessing pipeline of the new texts in the corpus.

Language: Russian

DOI

Text on another site

Keywords: chinese word segmentation grapheme-to-phoneme conversion (G2P)PoS-tagging out-of-vocabulary problem (OOV)code-switching detection автоматическая сегментация автоматическая транскрипция морфологическая аннотация проблема слов вне словаря автоматическое определение смены кодов

In book

Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference “Dialogue” (2021)

Issue 20: Основной том. , -, 2021.