?
Автоматическая саммаризация родительских чатов в WhatsApp
Automatic text summarization is one of the main tasks of natural language processing (NLP), which consists in creat-
ing a shorter version of the source text. In today’s world the amount of information consumed by people is constantly
increasing, therefore more and more emphasis is being placed on the task of summarization. There are two main ap-
proaches to automatic text summarization: extractive and abstractive ones. The latter involves automatic creation of a
summary text that may contain words and phrases not present in the source. This approach usually requires the usage
of AI models, which creates a demand for large datasets labeled in a certain way. Despite significant advances in sum-
marization of scientific and news articles, the methods and datasets applied to monologue documents are not always
suitable for dialogue summarization. Besides, although there exists a considerable number of English-language sum-
marization datasets, the number of those available in Russian is not yet sufficient. The paper is devoted to the labeling
and description of a Russian-language dataset for group chat messages summarization and fine-tuning models for the
task of abstractive summarization for the Russian language on a custom dialogue dataset. A parental chat with a teacher
in WhatsApp was used as material for the dataset. The process of manually labeling the dataset consisted in dividing
the entire group chat into separate dialogues, writing a summary, and adding topic labels for each of them. As a result,
a dataset has been created, which includes 616 dialogues with a total of 3380 messages. The ruT5, mT5 and RuGPT
models were selected for fine-tuning, the ruT5 and RuGPT models were pre-trained on a Russian-language dataset for
automatic news summarization. The ROUGE–1, ROUGE-2, ROUGE-L, BLEU and BERTScore metrics were used to
evaluate the quality of the models. Subsequently, the ruT5 model, fine-tuned on the custom dataset, turned out to out-
perform the baseline model in all the five metrics.