An Experimental Study of Term Extraction for Real Information-Retrieval Thesauri
Models for effective term extraction can depend
on the type of a terminological resource
under construction. In this paper
we study term extraction models for realworking
information-retrieval thesauri. The
first thesaurus is the English version of EuroVoc
thesaurus, the second one is the Russian
Banking thesaurus. We study singleword
and two-word term extraction separately
to reveal the best features and feature
combinations, compare best models for
two thesauri. In particular, we found for this
type of terminological resources the use of
association measures does not improve the
quality of two-word term extraction based
on combining multiple features.
This volume contains the papers selected for presentation at the 2014 IEEE/WIC/ACM International Conference on Web Intelligence (WI'14), held as part of the 2014 Web Intelligence Congress (WIC'14) at the University of Warsaw, Warsaw, Poland, from 11 to 14 in August, 2014. The conference was sponsored and co-organized by the IEEE Computer Society, the Web Intelligence Consortium (WIC), Association for Computing Machinery (ACM), the University of Warsaw, Polish Mathematical Society and Warsaw University of Technology.
The series of Web Intelligence conferences was started in Japan in 2001. Since then, it has been held yearly in several countries, including: Canada, China, France, USA, Australia and Italy. It is recognized as the World's leading forum focusing on the role of Web Intelligence as one of the most important directions for scientific research and development of solutions that contribute to creation of the Knowledge-based Society. In 2014, WI visited Poland as a special event commemorating the 25th anniversary of the Web.
WI'14 received 242 paper submissions, in the areas of foundations of Web Intelligence, semantic aspects of Web Intelligence, World Wide Wisdom Web, Web search and recommendation, Web mining and warehousing, Human-Web interaction, as well as Web Intelligence technologies and applications. After a rigorous evaluation process, 85 papers were selected as regular contributions, giving an acceptance rate of 35.1%.
The first five sections of this volume include 40 regular contributions. Additionally, the first paper in the first section corresponds to one of WIC'14 keynotes. The last four sections of this volume contain 23 papers selected for oral presentations in WI'14 workshops. The remaining 45 regular contributions and 25 papers accepted to WI'14 special sessions are published in another volume of WI’14 proceedings.
Formal Concept Analysis (FCA) is a mathematical technique that has been extensively applied to Boolean data in knowledge discovery, information retrieval, web mining, etc. applications. During the past years, the research on extending FCA theory to cope with imprecise and incomplete information made significant progress. In this paper, we give a systematic overview of the more than 120 papers published between 2003 and 2011 on FCA with fuzzy attributes and rough FCA. We applied traditional FCA as a text-mining instrument to 1072 papers mentioning FCA in the abstract. These papers were formatted in pdf files and using a thesaurus with terms referring to research topics, we transformed them into concept lattices. These lattices were used to analyze and explore the most prominent research topics within the FCA with fuzzy attributes and rough FCA research communities. FCA turned out to be an ideal metatechnique for representing large volumes of unstructured texts.
The article describes the implementation of the service, which allows to automate collection of structured information from unstructured web documents. The service unifies the solution for a variety of data domain by explicitly ontological description of a task. In addition, is not required change program code to increase the number of sources, because sources of information are also described by ontology.
This book constitutes the thoroughly refereed proceedings of the 8 th Russian Summer School on Information Retrieval, RuSSIR 2014, held in Nizhniy Novgorod, Russia, in August 2014.
The 14 papers presented were selected from various submissions. The papers focus on visualization for information retrieval along with other topics related to information retrieval.
he paper presents a framework for fast text analytics developed during the Texterra project. Texterra is a technology for multilingual text mining based on novel text processing methods that exploit knowledge extracted from user-generated content. It delivers a fast scalable solution for text mining without the expensive customization. Depending on use-cases Texterra could be utilized as a library, extendable framework or scalable cloudbased service. This paper describes details of the project, use-cases and results of evaluation for all developed tools. Texterra utilizes Wikipedia as a primary knowledge source to facilitate text mining in arbitrary documents (news, blogs, etc). We mine the graph of Wikipedia’s links to compute semantic relatedness between all concepts described in Wikipedia. As a result, we build a semantic graph with more than 5 million concepts. This graph is exploited to interpret meanings and relationships of terms in text documents. In spite of large size, Wikipedia doesn’t contain information about many domain-specific concepts. In order to increase applicability of the technology we developed several automatic knowledge extraction tools. These tools include systems for knowledge extraction from MediaWiki resources and Linked Data resources, as well as system for knowledge base extension with concepts described in arbitrary text documents using original information extraction techniques. In addition, utilization of information from Wikipedia allows easily extend Texterra for support of new Natural languages. The paper presents evaluation of Texterra applied for different text processing tasks (part-of-speech tagging, word sense disambiguation, keyword extraction and sentiment analysis) for English and Russian.
Today web spam is the one of the key problems of modern web search engines. In this paper we investigate the efficiency of various dimensionality reduction methods applying to the spam classifier of go.mail.ru search system. Effective utilization of such techniques can significantly increase the number of features and the quality of the classifier without loss of training and classification speed. We have conducted a series of experiments with PCA (Principal Component Analysis) и RP (Random Projection) dimensionality reduction methods. Unfortunately, these methods are shown to be ineffective applying to such issues, basically because of low-dimensional feature space. However this experiment led to the need for a detailed analysis of features, participating in the education process. For this analysis, we have chosen MRMR (Minimum Redundancy Maximum Relevance) criterion. Application of this criterion has allowed us to detect redundant features and estimate the efficiency of each of participating in education process feature. This research has allowed us significantly increase the quality of our web spam classifier without increasing number of features. This paper shows us all the efficiency of feature selection criterions in practice, and once again emphasizes the importance of a detailed analysis of the data and informative features, which are selected for training.
The present paper deals with word sense induction from lexical co-occurrence graphs. We construct such graphs on large Russian corpora and then apply the data to cluster the results of Mail.ru search according to meanings in the query. We compare different methods of performing such clustering and different source corpora. Models of applying distributional semantics to big linguistic data are described.
A method for fuzzy full text search is proposed. The method follows a popular two-stage scheme with a novel second stage: a prelim- inary search stage using an n-gram inverted index and, at the second stage, relevance checking between the query and documents using fre- quency annotated suffix trees (ASTs). The ASTs are built for all docu- ments of the collection off-line. The method is compared with two pop- ular fuzzy text retrieval techniques, one using n-gram inverted indexing with Levenshtein distance checking and signature hashing, and the other being Lemur, a popular toolkit for language modelling and information retrieval. For computational experiments we use ”Reuters 21578” text collection and a collection of USPTO patents. Our AST-based method generally leads to accuracy scores that are similar to those obtained by the winner, the Levenshtein distance-based method. However, our method significantly outperforms the Levenshtein distance based method over speed. Therefore, when using both criteria, the accuracy and speed, simultaneously, the AST-based method has shown significant advantages.
Proceedings of the 9th International Symposium on Intelligent Distributed Computing – IDC'2015, Guimarães, Portugal, October 2015