Adapting the Graph2Vec Approach to Dependency Trees for NLP Tasks

O. Durandin; A. Malafeev

doi:10.1007/978-3-030-39575-9_12

Publications

?

Adapting the Graph2Vec Approach to Dependency Trees for NLP Tasks

P. 120–131.

Durandin O., Malafeev A.

In recent works on learning representations for graph structures, methods have been proposed both for the representation of nodes and edges for large graphs, and for representation of graphs as a whole. This paper considers the popular graph2vec approach, which shows quite good results for ordinary graphs. In the field of natural language processing, however, a graph structure called a dependency tree is often used to express the connections between words in a sentence. We show that the graph2vec approach applied to dependency trees is unsatisfactory, which is due to the WL Kernel. In this paper, an adaptation of this kernel for dependency trees has been proposed, as well as 3 other types of kernels that take into account the specific features of dependency trees. This new vector representation can be used in NLP tasks where it is important to model syntax (e.g. authorship attribution, intention labeling, targeted sentiment analysis etc.). Universal Dependencies treebanks were clustered to show the consistency and validity of the proposed tree representation methods.

Language: English

Full text

DOI

Keywords: universal dependencies Graph Embeddings graph2vec dependency tree syntax embeddings графовые эмбеддинги синтаксические эмбеддинги

In book

Analysis of Images, Social Networks and Texts. 8th International Conference, AIST 2019, Kazan, Russia, July 17–19, 2019, Revised Selected Papers. Communications in Computer and Information Science

Vol. 1086. , Springer, 2020.

GSM: Inductive Learning on Dynamic Graph Embeddings

Ananyeva M., Makarov I., Pendiukhov M., , in: Network Algorithms, Data Mining, and Applications. Springer Proceedings in Mathematics & Statistics. Springer, 2020. P. 85–99.

In this paper, we study the problem of learning graph embeddings for dynamic networks and the ability to generalize to unseen nodes called inductive learning. Firstly, we overview the state-of-the-art methods and techniques for constructing graph embeddings and learning algorithms for both transductive and inductive approaches. Secondly, we propose an improved GSM based on GraphSAGE ...

Added: February 27, 2020

Building a Graph-Based Recommender Using Community Embeddings

Anton Begehr, Peter Panfilov, , in: ICCTA '22: Proceedings of the 2022 8th International Conference on Computer Technology Applications. NY: Association for Computing Machinery (ACM), 2022. Ch. 19 P. 121–127.

In this work, we explore the application of graph embedding to the design and development of a friend recommender system for the users of the social network. Graph embedding could be useful for recommendation tasks because of data compression, the feature vector format, and sub-quadratic time complexity of graph embedding. We suggest and study a ...

Added: September 26, 2022

Building a Universal Dependencies Treebank for a Polysynthetic Language: the Case of Abaza

Koshevoy A., Panova A., Makarchuk I., , in: Proceedings of the Sixth Workshop on Universal Dependencies (UDW, GURT/SyntaxFest 2023). Washington: Association for Computational Linguistics, 2023. P. 1–6.

In this paper, we discuss the challenges that we faced during the construction of a Universal Dependencies treebank for Abaza, a polysynthetic Northwest Caucasian language. We propose an alternative to the morpheme-level annotation of polysynthetic languages introduced in Park et al. (2021). Our approach aims at reducing the number of morphological features, yet providing all ...

Added: March 20, 2023

Fusion of text and graph information for machine learning problems on networks

Makarov I., Makarov M., Kiselev D., PeerJ Computer Science 2021 Vol. 7 Article e526

Today, increased attention is drawn towards network representation learning, a technique that maps nodes of a network into vectors of a low-dimensional embedding space. A network embedding constructed this way aims to preserve nodes similarity and other specific network properties. Embedding vectors can later be used for downstream machine learning problems, such as node classification, ...

Added: March 31, 2021

Cross-tagset parsing evaluation for Russian

Дроганова К. А., Lyashevskaya O., , in: Digital Transformation and Global Society Third International Conference, DTGS 2018, St. Petersburg, Russia, May 30 –June 2, 2018, Revised Selected Papers, Part IIssue 858. Cham: Springer, 2018. Ch. 31 P. 380–390.

Cross-tagset parsing is based on the substitution of one annotation layer for another while processing data within one language. As often as not, either the native tagger or the dependency parser used in (pre-)annotation of the Gold treebank is not available. The crosstagset approach allows one to annotate new texts using freely available tools or ...

Added: October 10, 2018

Recommending Collaborators via Co-authorship Network Embedding

Makarov I., Kiselev D., Gerasimova O. et al., Network Science 2020 P. 1–13

In this paper, we study network feature engineering for the problem of future co-author recommendation, also called collaborator recommender system. We present a system, which uses authors' research interests and existing collaboration information to predict missing and most probable in the future links in the co-authorship network. The recommender system is stated as a link ...

Added: October 27, 2020

Optimal Two-Sided Embeddings of Complete Binary Trees in Rectangular Grids

Высоцкий Л. И., Ложкин С. А., Computational Mathematics and Modeling 2019 Vol. 30 No. 2 P. 115–128

The article considers the construction of optimal-area homeomorphic embeddings of complete binary trees in rectangular grids such that the leaf images are on the opposite sides of the grid and the edge images intersect only at node images. The minimum grid area that admits the embedding of a complete binary tree of depth n is ...

Added: November 10, 2020

Text collections for evaluation of Russian morphological taggers

Lyashevskaya O., Bocharov V., Sorokin A. et al., Jazykovedny Casopis 2017 Vol. 68 No. 2 P. 258–267

The paper describes the preparation and development of the text collections within the framework of MorphoRuEval-2017 shared task, an evaluation campaign designed to stimulate development of the automatic morphological processing technologies for Russian. The main challenge for the organizers was to standardize all available Russian corpora with the manually verified high-quality tagging to a single ...

Added: January 30, 2018

Survey on graph embeddings and their applications to machine learning problems on graphs

Makarov I., Kiselev D., Nikitinsky N. et al., PeerJ Computer Science 2021 Vol. 7 P. 1–62

Dealing with relational data always required significant computational resources, domain expertise and task-dependent feature engineering in order to incorporate structural information into predictive model. Nowadays, a family of automated graph feature engineering techniques have been proposed in different streams of literature. So-called graph embeddings provide a powerful tool to construct vectorized feature spaces for graphs ...

Added: October 27, 2020

Amateur Prose On The Web: Verb Construction As A Feature Of Genre Classification

Builova N., , in: Proceedings of Third Workshop "Computational linguistics and language science"Issue 4. Manchester: EasyChair, 2019.

In our research we studied the dependency structure of the text genre love stories, detective stories, science fiction and fantasy). The novel characteristics (such syntactic attributes as verb constructions and construction of a specific cumulative threshold) which can be additional machine learning parameters were identified. We conducted experiment with novel features and showed that these ...

Added: December 11, 2018

Sculpting enhanced dependencies for Belarusian

Yana Shishkina, Lyashevskaya O., , in: Analysis of Images, Social Networks and Texts. 10th International Conference, AIST 2021, Tbilisi, Georgia, December 16–18, 2021, Revised Selected Papers. Cham: Springer, 2022. P. 137–147.

Enhanced Universal Dependencies (EUD) are enhanced graphs expressed on top of basic dependency trees. EUD support repre- sentation of deeper syntactic relations in constructions such as coordi- nation, gapping, relative clauses, and argument sharing through control and raising. The paper presents experiments on the EUD parsing of the low-resource Belarusian language, for which no corpora ...

Added: January 4, 2022

REALEC learner treebank: annotation principles and evaluation of automatic parsing

Lyashevskaya O., Пантелеева И. М., , in: Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories (TLT 16). Association for Computational Linguistics, 2017. P. 80–87.

The paper presents a Universal Dependencies (UD) annotation scheme for a learner English corpus. The REALEC dataset consists of essays written in English by Russian-speaking university students in the course of general English. The original corpus is manually annotated for learners’ errors and gives information on the error span, error type, and the possible correction ...

Added: December 11, 2018

Research Papers Recommendation

Gerasimova O., Makarov I., Лапидус А. А., , in: Analysis of Images, Social Networks and Texts. 10th International Conference, AIST 2021, Tbilisi, Georgia, December 16–18, 2021, Revised Selected Papers. Cham: Springer, 2022. P. 1–14.

The work is devoted to academic papers recommendation task considered as link prediction on a static citation network. We compare several graph embeddings, text-based and fusion models in the link prediction problem on academic papers citation dataset. We showed that fusion models of graph and text information outperform other approaches based on graph or text information alone. We prove ...

Added: December 3, 2021

Automatic dependency parsing of a learner English corpus REALEC

Lyashevskaya O., Пантелеева И. М., / NRU HSE. Series WP BRP "Linguistics". 2017.

The paper presents a Universal Dependencies (UD) annotation scheme for a learner English corpus. The REALEC dataset consists of essays written in English by Russian-speaking university students in the course of general English. The essays are a part of students' preparation for the independent final examination similar to the international English exam. While adjusting existing ...

Added: December 15, 2017

An HMM-based PoS tagger for Old Church Slavonic

Lyashevskaya O., Afanasev I., Jazykovedny Casopis 2021 Vol. 72 No. 2 P. 556–567

We present a hybrid HMM-based PoS tagger for Old Church Slavonic. The training corpus is a portion of one text, Codex Marianus (40k) annotated with the Universal Dependencies UPOS tags in the UD-PROIEL treebank. We perform a number of experiments in within-domain and out-of-domain settings, in which the remaining part of Codex Marianus serves as ...

Added: October 21, 2021

A Reusable Tagset for the Morphologically Rich Language in Change: a Case of Middle Russian

Lyashevskaya O., , in: Computational Linguistics and Intellectual TechnologiesIssue 18. M.: Russian State University for the Humanitie, 2019. P. 422–434.

The paper discusses the standardization efforts to create a morphological standard for the Middle Russian corpus, which is part of the historical collection of the Russian National Corpus (RNC). To meet the needs of different categories of corpus researchers as well as NLP developers, we consider two styles of the morphological annotation (RNC schema and ...

Added: June 12, 2019

Length of East Caucasian subject indexes: a quantative research

Moroz G., , in: Дурхъаси хазна. Сборник статей к 60-летию Р. О. Муталова. М.: Буки Веди, 2021. P. 258–282.

In this article I present a connection between frequency and length of person-number indexes via two independent researches: token frequency obtained from the Universal Dependencies’ treebanks and type frequency gathered within a typological study. After introducing the results of those two studies, I will present East Caucasian data. I show that the unusual history of ...

Added: May 23, 2021

Automatic morphological analysis on the material of Russian social media texts

Fenogenova A., Kazorin V., Karpov I. et al., , in: Proceedings of Third Workshop "Computational linguistics and language science"Issue 4. Manchester: EasyChair, 2019. P. 11–17.

Automatic morphological analysis is one of the fundamental and significant tasks of NLP (Natural Language Processing). Due to special features of Internet texts, as they can be both normative texts (news, fiction, nonfiction) and less formal texts (such as blogs and texts from social networks), the morphological tagging has become non-trivial and an actual task. ...

Added: October 5, 2018

Universal Dependencies for Russian: A New Syntactic Dependencies Tagset

Lyashevskaya O., Droganova K., Zeman D. et al., / NRU HSE. Series WP BRP "Linguistics". 2016. No. 44.

This paper presents the Universal Dependencies tagset (UD v1) as a new annotation scheme for Russian treebanks. The universal list of dependency relations was adopted and extended to comply with certain language-specific syntactic constructions. The tagset was validated, converting two Russian treebanks into the UD format, UD-Russian-SynTagRus and UD-Russian-Google. ...

Added: December 14, 2016

Proceedings of the Sixth Workshop on Universal Dependencies (UDW, GURT/SyntaxFest 2023)

Washington: Association for Computational Linguistics, 2023.

Added: March 20, 2023

MorphoRuEval-2017: an Evaluation Track for the Automatic Morphological Analysis Methods for Russian

Sorokin A., Shavrina T., Lyashevskaya O. et al., , in: Computational Linguistics and Intellectual Technologies. International Conference "Dialogue 2017" ProceedingsVol. 1. Issue 16 (23). M.: -, 2017. P. 297–313.

MorphoRuEval-2017 is an evaluation campaign designed to stimulate the development of the automatic morphological processing technologies for Russian, both for normative texts (news, fiction, nonfiction) and those of less formal nature (blogs and other social media). This article compares the methods participants used to solve the task of morphological analysis. It also discusses the problem ...

Added: October 9, 2018

Detecting Automatically Managed Accounts in Online Social Networks: Graph Embeddings Approach

Karpov I., Glazkova E., , in: Recent Trends in Analysis of Images, Social Networks and Texts. 9th International Conference, AIST 2020, Skolkovo, Moscow, Russia, October 15–16, 2020 Revised Supplementary ProceedingsVol. 12602. Springer, 2021. P. 11–21.

The widespread of Online Social Networks and the opportunity to commercialize popular accounts have attracted a large number of automated programs, known as artificial accounts. This paper (Project repository available at http://github.com/karpovilia/botdetection) focuses on the classification of human and fake accounts on the social network, by employing several graph neural networks, to efficiently encode attributes and ...

Added: June 19, 2021

Data Conversion and Consistency of Monolingual Corpora: Russian UD Treebanks

Дроганова К. А., Lyashevskaya O., Zeman D., , in: Proceedings of TLT 2018 International Workshop on Treebanks and Linguistic Theories, 13-14 November 2018, Oslo, Norway. NEALT Proceedings Series. Linköping University Electronic Press, 2018. P. 52–65.

In this paper we focus on syntactic annotation consistency within Universal Dependencies (UD) treebanks for Russian: UD_Russian-SynTagRus, UD_Russian-GSD, UD\_Russian-Taiga, and UD_Russian-PUD. We describe the four treebanks, their distinctive features and development. In order to test and improve consistency within the treebanks, we reconsidered the experiments by Martinez Alonso and Zeman; our parsing experiments were conducted ...

Added: November 6, 2018

Использование универсальных зависимостей при грамматическом разборе многоязычного текста (на примере безличного предикатива)

Lyukina E. V., Вестник Новосибирского государственного университета. Серия: Лингвистика и межкультурная коммуникация 2018 Т. 16 № 2 С. 19–33

The paper is dedicated to the initiative of universal dependences (UD), with aim to develop cross-linguistically consistent annotation scheme of grammatical analysis. The purpose of this initiative is in simplification of cross-language research, unification of interlanguage linguistic typology, building a foundation for the automated multilingual systems and the universal cross-language text parser. In the first part ...

Added: April 21, 2018