Semantic Clustering of Russian Web Search Results: Possibilities and Problems
The present paper deals with word sense induction from lexical co-occurrence graphs. We construct such graphs on large Russian corpora and then apply the data to cluster the results of Mail.ru search according to meanings in the query. We compare different methods of performing such clustering and different source corpora. Models of applying distributional semantics to big linguistic data are described.
This volume contains the papers selected for presentation at the 2014 IEEE/WIC/ACM International Conference on Web Intelligence (WI'14), held as part of the 2014 Web Intelligence Congress (WIC'14) at the University of Warsaw, Warsaw, Poland, from 11 to 14 in August, 2014. The conference was sponsored and co-organized by the IEEE Computer Society, the Web Intelligence Consortium (WIC), Association for Computing Machinery (ACM), the University of Warsaw, Polish Mathematical Society and Warsaw University of Technology.
The series of Web Intelligence conferences was started in Japan in 2001. Since then, it has been held yearly in several countries, including: Canada, China, France, USA, Australia and Italy. It is recognized as the World's leading forum focusing on the role of Web Intelligence as one of the most important directions for scientific research and development of solutions that contribute to creation of the Knowledge-based Society. In 2014, WI visited Poland as a special event commemorating the 25th anniversary of the Web.
WI'14 received 242 paper submissions, in the areas of foundations of Web Intelligence, semantic aspects of Web Intelligence, World Wide Wisdom Web, Web search and recommendation, Web mining and warehousing, Human-Web interaction, as well as Web Intelligence technologies and applications. After a rigorous evaluation process, 85 papers were selected as regular contributions, giving an acceptance rate of 35.1%.
The first five sections of this volume include 40 regular contributions. Additionally, the first paper in the first section corresponds to one of WIC'14 keynotes. The last four sections of this volume contain 23 papers selected for oral presentations in WI'14 workshops. The remaining 45 regular contributions and 25 papers accepted to WI'14 special sessions are published in another volume of WI’14 proceedings.
Formal Concept Analysis (FCA) is a mathematical technique that has been extensively applied to Boolean data in knowledge discovery, information retrieval, web mining, etc. applications. During the past years, the research on extending FCA theory to cope with imprecise and incomplete information made significant progress. In this paper, we give a systematic overview of the more than 120 papers published between 2003 and 2011 on FCA with fuzzy attributes and rough FCA. We applied traditional FCA as a text-mining instrument to 1072 papers mentioning FCA in the abstract. These papers were formatted in pdf files and using a thesaurus with terms referring to research topics, we transformed them into concept lattices. These lattices were used to analyze and explore the most prominent research topics within the FCA with fuzzy attributes and rough FCA research communities. FCA turned out to be an ideal metatechnique for representing large volumes of unstructured texts.
Models for effective term extraction can depend on the type of a terminological resource under construction. In this paper we study term extraction models for realworking information-retrieval thesauri. The first thesaurus is the English version of EuroVoc thesaurus, the second one is the Russian Banking thesaurus. We study singleword and two-word term extraction separately to reveal the best features and feature combinations, compare best models for two thesauri. In particular, we found for this type of terminological resources the use of association measures does not improve the quality of two-word term extraction based on combining multiple features.
The goal of the expert search task is finding knowledgeable persons within the enterprise. In this paper we focus on its distinctions from the other information retrieval tasks. We review the existing ap- proaches and propose a new term weighting scheme which is based on analysis of communication patterns between people. The effectiveness of the proposed approach is evaluated on a collection of e-mails from an organization of approximately 1500 people. Results show that it is possible to take into account communication structure in the process of term weighting, effectively combining communication-based and document-based approaches to expert finding.
This book constitutes the thoroughly refereed proceedings of the 8 th Russian Summer School on Information Retrieval, RuSSIR 2014, held in Nizhniy Novgorod, Russia, in August 2014.
The 14 papers presented were selected from various submissions. The papers focus on visualization for information retrieval along with other topics related to information retrieval.
Today web spam is the one of the key problems of modern web search engines. In this paper we investigate the efficiency of various dimensionality reduction methods applying to the spam classifier of go.mail.ru search system. Effective utilization of such techniques can significantly increase the number of features and the quality of the classifier without loss of training and classification speed. We have conducted a series of experiments with PCA (Principal Component Analysis) и RP (Random Projection) dimensionality reduction methods. Unfortunately, these methods are shown to be ineffective applying to such issues, basically because of low-dimensional feature space. However this experiment led to the need for a detailed analysis of features, participating in the education process. For this analysis, we have chosen MRMR (Minimum Redundancy Maximum Relevance) criterion. Application of this criterion has allowed us to detect redundant features and estimate the efficiency of each of participating in education process feature. This research has allowed us significantly increase the quality of our web spam classifier without increasing number of features. This paper shows us all the efficiency of feature selection criterions in practice, and once again emphasizes the importance of a detailed analysis of the data and informative features, which are selected for training.
Proceedings of the 9th International Symposium on Intelligent Distributed Computing – IDC'2015, Guimarães, Portugal, October 2015