Unsupervised learning of general-purpose embeddings for code changes

Pravilov M.; Bogomolov E.; Golubev Y.; T. Bryksin

doi:10.1145/3472674.3473979

Publications

?

Unsupervised learning of general-purpose embeddings for code changes

Ch. 171275. P. 7–12.

Pravilov M., Bogomolov E., Golubev Y., Bryksin T.

Applying machine learning to tasks that operate with code changes requires their numerical representation. In this work, we propose an approach for obtaining such representations during pre-training and evaluate them on two different downstream tasks - applying changes to code and commit message generation. During pre-training, the model learns to apply the given code change in a correct way. This task requires only code changes themselves, which makes it unsupervised. In the task of applying code changes, our model outperforms baseline models by 5.9 percentage points in accuracy. As for the commit message generation, our model demonstrated the same results as supervised models trained for this specific task, which indicates that it can encode code changes well and can be improved in the future by pre-training on a larger dataset of easily gathered code changes. © 2021 ACM.

Keywords: Unsupervised learning Code changes Commit message generation

In book

MaLTESQuE 2021: Proceedings of the 5th International Workshop on Machine Learning Techniques for Software Quality Evolution

ACM, 2021.

Topological Metric for Unsupervised Embedding Quality Evaluation

Shestov A., Klenitskiy A., Denisova D. et al., , in: Advances in Information Retrieval: 48th European Conference on Information Retrieval, ECIR 2026, Delft, The Netherlands, March 29 – April 2, 2026, Proceedings, Part II. (LNCS, volume 16484).: Cham: Springer Publishing Company, 2026. P. 596–605.

Modern representation learning increasingly relies on unsu-pervised and self-supervised methods trained on large-scale unlabeled data. While these approaches achieve impressive generalization across tasks and domains, evaluating embedding quality without labels remains an open challenge. In this work, we propose Persistence, a topology-aware metric based on persistent homology that quantifies the geomet-ric structure and topological richness ...

Added: June 18, 2026

Learning to hear broken motors: Signature-guided data augmentation for induction motor diagnostics

Ali S., Khizhik A., Svirin S. et al., Engineering Applications of Artificial Intelligence 2025 Vol. 170 Article 114137

The application of machine learning algorithms in the intelligent diagnosis of three-phase engine has the potential to significantly enhance diagnostic performance and accuracy. Traditional methods largely rely on signature analysis, which, despite being a standard practice, can benefit from the integration of advanced machine learning techniques. In our study, we innovate by combining machine learning ...

Added: February 16, 2026

From Patterns to Predictions: A Shapelet-Based Framework for Directional Forecasting in Noisy Financial Markets

Kim J., Lee H., Jeon H. et al., , in: CIKM '25: Proceedings of the 34rd ACM International Conference on Information and Knowledge Management.: ACM, 2025. P. 1344–1353.

Directional forecasting in financial markets requires both accuracy and interpretability. Before the advent of deep learning, interpretable approaches based on human-defined patterns were prevalent, but their structural vagueness and scale ambiguity hindered generalization. In contrast, deep learning models can effectively capture complex dynamics, yet often offer limited transparency. To bridge this gap, we propose a ...

Added: November 21, 2025

Leveraging Recursive Gumbel-Max Trick for Approximate Inference in Combinatorial Spaces

Kirill Struminsky, Artyom Gadetsky, Denis Rakitin et al., , in: Advances in Neural Information Processing Systems 34 (NeurIPS 2021).: Curran Associates, Inc., 2021. P. 10999–11011.

Structured latent variables allow incorporating meaningful prior knowledge into deep learning models. However, learning with such variables remains challenging because of their discrete nature. Nowadays, the standard learning approach is to define a latent variable as a perturbed algorithm output and to use a differentiable surrogate for training. In general, the surrogate puts additional constraints ...

Added: March 14, 2022

Formal Concept Analysis: 16th International Conference, ICFCA 2021, Strasbourg, France, June 29 – July 2, 2021, Proceedings

Springer, 2021.

This book constitutes the proceedings of the 16th International Conference on Formal Concept Analysis, ICFCA 2021, held in Strasbourg, France, in June/July 2021. The 14 full papers and 5 short papers presented in this volume were carefully reviewed and selected from 32 submissions. The book also contains four invited contributions in full paper length. The research part ...

Added: July 10, 2021

A density-based statistical analysis of graph clustering algorithm performance

Miasnikof P., Shestopaloff A. Y., Bonner A. J. et al., Journal of Complex Networks 2020 Vol. 8 No. 3 P. 1–33

We introduce graph clustering quality measures based on comparisons of global, intra- and inter-cluster densities, an accompanying statistical significance test and a step-by-step routine for clustering quality assessment. Our work is centred on the idea that well-clustered graphs will display a mean intra-cluster density that is higher than global density and mean inter-cluster density. We ...

Added: August 4, 2020

A Simple Method to Evaluate Support Size and Non-uniformity of a Decoder-Based Generative Model

Struminsky K., Vetrov D., Lecture Notes in Computer Science 2019 Vol. 11832 P. 81–93

Theoretical analysis in [1] suggested that adversarially trained generative models are naturally inclined to learn distribution with low support. In particular, this effect is caused by the limited capacity of the discriminator network. To verify this claim, [2] proposed a statistical test based on the birthday paradox that partially confirmed the analysis. In this paper, ...

Added: April 23, 2020

Variational Autoencoder with Arbitrary Conditioning

Vetrov D., Ivanov O., , in: Proceedings of the 7th International Conference on Learning Representations (ICLR 2019).: ICLR, 2019. P. 1–25.

We propose a single neural probabilistic model based on variational autoencoder that can be conditioned on an arbitrary subset of observed features and then sample the remaining features in "one shot". The features may be both real-valued and categorical. Training of the model is performed by stochastic variational Bayes. The experimental evaluation on synthetic data, ...

Added: March 13, 2020

Towards Automatic Manipulation of Arbitrary Structures in Connectivist Paradigm with Tensor Product Variable Binding

Demidovskij A., , in: Advances in Neural Computation, Machine Learning, and Cognitive Research III.: Springer, 2020. P. 375–383.

Building a bridge between symbolic and connectionist level of computations requires constructing a full pipeline that accepts symbolic structures as an input, translates them to distributed representation, performs manipulations with this representation equivalent to symbolic manipulations and translates it back to the symbolic structure. This work proposes neural architecture that is capable of joining two ...

Added: October 27, 2019

Использование метода главных компонент для анализа надежности цепей поставок

Kuznetsov V. O., Логистика и управление цепями поставок 2018 № 4 (87) С. 27–33

One of the options for a more flexible approach to analyzing the reliability of supply chains is the principal component analysis (PCA). With a large number of variables describing supply chain, it is a difficult task to analyze the structure of variables in two-dimensional space. Within the analysis of the variables dependencies PCA allows to ...

Added: November 29, 2018

Mining convex polygon patterns with formal Concept Analysis

Belfodil A., Kuznetsov S., Robardet C. et al., , in: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017,Melbourne, Australia, 19-25 August 2017.: Melbourne: International Joint Conferences on Artificial Intelligence, 2017. P. 1425–1432.

Pattern mining is an important task in AI for eliciting hypotheses from the data. When it comes to spatial data, the geo-coordinates are often considered independently as two different attributes. Consequently, rectangular shapes are searched for. Such an arbitrary form is not able to capture interesting regions in general. We thus introduce convex polygons, a ...

Added: December 6, 2017

Устойчивый к шуму метод обучения вариационного автокодировщика

Figurnov M., Struminsky K., Vetrov D., Интеллектуальные системы. Теория и приложения 2017 Т. 21 № 2 С. 90–109

Variational autoencoder (VAE) is a probabilistic unsupervised method that uses deep learning. We propose a robust approach to the training of VAE using a modified likelihood function. We propose and analyze two variational lower bound objectives. The effectiveness of the method is experimentally shown by artificially introducing noise objects. ...

Added: October 18, 2017

Лексическая сочетаемость как ключевой компонент в процессе формирования коммуникативной компетенции

Shemyakina V. I., В кн.: Коммуникация в современном поликультурном мире: диалог культур: Сборник научно-практических трудовВып. 2.: М.: Pearson Education Limited (российское представительство), 2014. С. 568–579.

English language teaching improvement has as its goal the communicative competence development within integration processes.Collocations are essential for communicative competence development. Collocations and different forms of unsupervised acquisition are compulsory components for IELTS preparation. ...

Added: March 5, 2015

On Hölder fields clustering

Cadre B., Paris Quentin, TEST 2012 Vol. 21 No. 2 P. 301–316

Based on n randomly drawn vectors in a Hilbert space, we study the k-means clustering scheme. Here, clustering is performed by computing the Voronoi partition associated with centers that minimize an empirical criterion, called distorsion. The performance of the method is evaluated by comparing the theoretical distorsion of empirical optimal centers to the theoretical optimal distorsion. Our first ...

Added: December 20, 2014