Semantic Feature Aggregation for Gender Identification in Russian Facebook
The goal of the current work is to evaluate semantic feature aggregation techniques in a task of gender classification of public social media texts in Russian. We collect Facebook posts of Russian-speaking users and apply them as a dataset for two topic modelling techniques and a distributional clustering approach. The output of the algorithms is applied as a feature aggregation method in a task of gender classification based on a smaller Facebook sample. The classification performance of the best model is favorably compared against the lemmas baseline and the state-of-the-art results reported for a different genre or language. The resulting successful features are exemplified, and the difference between the three techniques in terms of classification performance and feature contents are discussed, with the best technique clearly outperforming the others.
This study explores relationship between the Internet and the Russian national election of 2011-2012. In contrast to other studies, we focus on the blogosphere as a political factor. Our conclusions are based on a study of the LiveJournal blogging platform represented by a sample of political posts from the top 2000 bloggers for 13 week-long periods. Sampling from the population of about 180,000 posts was performed automatically with a topic modelling algorithm, while the analysis of the resulting 3690 texts was carried out manually by five coders. We found that the most influential Russian blogs perform the role of a media “stronghold” of the political opposition. Moreover, we established a relationship between the weekly pre-election ratings of the opposition parties and presidential candidates and the indicators of political activity in the blogosphere. Our results cautiously suggest that political activity on the Internet is not simply an online projection of offline political activity: it can itself provoke activity in offline political life.
Distributed vector representations for natural language vocabulary get a lot of attention in contemporary computational linguistics. This paper summarizes the experience of applying neural network language models to the task of calculating semantic similarity for Russian. The experiments were performed in the course of Russian Semantic Similarity Evaluation track, where our models took from 2nd to 5th position, depending on the task. We introduce the tools and corpora used, comment on the nature of the evaluation track and describe the achieved results. It was found out that Continuous Skip-gram and Continuous Bag-of-words models, previously successfully applied to English material, can be used for semantic modeling of Russian as well. Moreover, we show that texts in Russian National Corpus (RNC) provide an excellent training material for such models, outperforming other, much larger corpora. It is especially true for semantic relatedness tasks (although stacking models trained on larger corpora on top of RNC models improves performance even more). High-quality semantic vectors learned in such a way can be used in a variety of linguistic tasks and promise an exciting field for further study.
The paper considers the problems of visual analysis of clusters for multidimensional textual datasets. To analyze clusters in original data volume the elastic maps are used as the methods of original data points mapping to enclosed manifolds having less dimensionality. Diminishing the elasticity parameters one can design map surface which approximates the multidimensional textual dataset in question much better. This approach of elastic maps does not require any apriori information about data in question and does not depend on data nature, data origin, etc. Probabilistic algorithm t-SNE (t-distributed stochastic neighbor embedding) has similar properties and is quite close in ideology to elastic maps. The paper describes the results of both (elastic maps and t-SNE) approaches application to visual analysis of clusters in multidimensional textual datasets. For elastic maps a technology «Quasi-Zoom» is proposed. This technology allows to improve the results of cluster analysis in the fields of data points concentration. Presented results illustrate an efficiency and applicability of both approaches to cluster analysis of natural language terms.
The present paper deals with word sense induction from lexical co-occurrence graphs. We construct such graphs on large Russian corpora and then apply the data to cluster the results of Mail.ru search according to meanings in the query. We compare different methods of performing such clustering and different source corpora. Models of applying distributional semantics to big linguistic data are described.
In natural language processing, distributional semantic models are known as an efficient data driven approach to word and text representation, which allows computing meaning directly from large text corpora into word embeddings in a vector space. This paper addresses the role of linguistic preprocessing in enhancing performance of distributional models, and particularly studies pronominal anaphora resolution as a way to exploit more co-occurrence data without directly increasing the size of the training corpus. We replace three different types of anaphoric pronouns with their antecedents in the training corpus and evaluate the extent to which this affects the performance of the resulting models in lexical similarity tasks. CBOW and SkipGram distributed models trained on Russian National Corpus are in the focus of our research, although the results are potentially applicable to other distributional semantic frameworks and languages as well. The trained models are evaluated against RUSSE '15 and SimLex-999 gold standard data sets. As a result, we find that models trained on corpora with pronominal anaphora resolved perform significantly better than their counterparts trained on baseline corpora.
Purpose – The paper addresses the problem of what drives the formation of latent discussion communities, if any, in the blogosphere: topical composition of posts or their authorship? The purpose of this paper is to contribute to the knowledge about structure of co-commenting.
Design/methodology/approach – The research is based on a dataset of 17,386 full text posts written by top 2,000 LiveJournal bloggers and over 520,000 comments that result in about 4.5 million edges in the network of co-commenting, where posts are vertices. The Louvain algorithm is used to detect communities of co-commenting. Cosine similarity and topic modeling based on latent Dirichlet allocation are applied to study topical coherence within these communities.
Findings – Bloggers unite into moderately manifest communities by commenting roughly the same sets of posts. The graph of co-commenting is sparse and connected by a minority of active non-top commenters. Communities are centered mainly around blog authors as opinion leaders and, to a lesser extent, around a shared topic or topics.
Research limitations/implications – The research has to be replicated on other datasets with more thorough hand coding to ensure the reliability of results and to reveal average proportions of topic-centered communities.
Practical implications – Knowledge about factors around which co-commenting communities emerge, in particular clustered opinion leaders that often attract such communities, can be used by policy makers in marketing and/or political campaigning when individual leadership is not enough or not applicable.
Originality/value – The research contributes to the social studies of online communities. It is the first study of communities based on co-commenting that combines examination of the content of commented posts and their topics.