Empirical Study of Transformers for Source Code

N. Chirkova; S. Troshin

doi:10.1145/3468264.3468611

Publications

?

Empirical Study of Transformers for Source Code

P. 703–715.

Chirkova N., Troshin S.

Initially developed for natural language processing (NLP), Transformers are now widely used for source code processing, due to the format similarity between source code and text. In contrast to natural language, source code is strictly structured, i.e., it follows the syntax of the programming language. Several recent works develop Transformer modifications for capturing syntactic information in source code. The drawback of these works is that they do not compare to each other and consider different tasks. In this work, we conduct a thorough empirical study of the capabilities of Transformers to utilize syntactic information in different tasks. We consider three tasks (code completion, function naming and bug fixing) and re-implement different syntax-capturing modifications in a unified framework. We show that Transformers are able to make meaningful predictions based purely on syntactic information and underline the best practices of taking the syntactic information into account for improving the performance of the model.

Language: English

DOI

Text on another site

Keywords: Transformer source code processing

In book

ESEC/FSE 2021: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Association for Computing Machinery (ACM), 2021.

A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code

Chirkova N., Troshin S., , in: 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2021). Association for Computational Linguistics, 2021. P. 278–288.

Added: August 31, 2021

Empirical Study of Transformers for Source Code

Chirkova N., Troshin S., / Series arxiv "CS". 2020.

Added: October 19, 2020

On the Embeddings of Variables in Recurrent Neural Networks for Source Code

Chirkova N., , in: 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2021). Association for Computational Linguistics, 2021. P. 2679–2689.

Source code processing heavily relies on the methods widely used in natural language processing (NLP), but involves specifics that need to be taken into account to achieve higher quality. An example of this specificity is that the semantics of a variable is defined not only by its name but also by the contexts in which ...

Added: August 31, 2021

LIORI at the FinCausal 2020 Shared task

Gordeev D., Davletov A., Rey A. et al., , in: Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation. COLING, 2020. P. 45–49.

In this paper, we describe the results of team LIORI at the FinCausal 2020 Shared task held as a part of the 1st Joint Workshop on Financial Narrative Processing and MultiLingual Financial Summarisation. The shared task consisted of two subtasks: classifying whether a sentence contains any causality and labelling phrases that indicate causes and consequences. ...

Added: December 7, 2020

Gorynych Transformer at SemEval-2020 Task 6: Multi-task Learning for Definition Extraction

Davletov A., Nikolay Arefyev, Shatilov A. et al., , in: Proceedings of the 14th International Workshop on Semantic Evaluation (SemEval-2020). Association for Computational Linguistics, 2020. P. 487–493.

This paper describes our approach to “DeftEval: Extracting Definitions from Free Text in Textbooks” competition held as a part of Semeval 2020. The task was devoted to finding and labeling definitions in texts. DeftEval was split into three subtasks: sentence classification, sequence labeling and relation classification. Our solution ranked 5th in the first subtask and ...

Added: December 7, 2020