Роль общей и специфической лексики при извлечении информации из текста на примере анализа события «Ввод новых технологий»
This paper discusses approaches to the selection of keywords, used for information extraction of event frames. In particular, the innovation event is associated with different lexical items in different areas of knowledge. The paper evaluated the contribution of general and specific vocabulary in the representation of the frame in a particular subject area.
Formal Concept Analysis (FCA) is a mathematically well-founded theory aimed at data analysis and classication, introduced and detailed in the book of Bernhard Ganter and Rudolf Wille, \Formal Concept Analysis", Springer 1999. The area came into being in the early 1980s and has since then spawned over 10000 scientic publications and a variety of practically deployed tools. FCA allows one to build from a data table with objects in rows and attributes in columns a taxonomic data structure called concept lattice, which can be used for many purposes, especially for Knowledge Discovery and Information Retrieval. The \Formal Concept Analysis Meets Information Retrieval" (FCAIR) workshop collocated with the 35th European Conference on Information Retrieval (ECIR 2013) was intended, on the one hand, to attract researchers from FCA community to a broad discussion of FCA-based research on information retrieval, and, on the other hand, to promote ideas, models, and methods of FCA in the community of Information Retrieval. This volume contains 11 contributions to FCAIR workshop (including 3 abstracts for invited talks and tutorial) held in Moscow, on March 24, 2013. All submissions were assessed by at least two reviewers from the program committee of the workshop to which we express our gratitude. We would also like to thank the co-organizers and sponsors of the FCAIR workshop: Russian Foundation for Basic Research, National Research University Higher School of Economics, and Yandex.
This book constitutes the refereed proceedings of the 20th International Symposium on String Processing and Information Retrieval, SPIRE 2013, held in Jerusalem, Israel, in October 2013. The 18 full papers, 10 short papers were carefully reviewed and selected from 60 submissions. The program also featured 4 keynote speeches. The following topics are covered: fundamentals algorithms in string processing and information retrieval; SP and IR techniques as applied to areas such as computational biology, DNA sequencing, and Web mining.
Many semantic text analysis problems employ string-to-text relevance measures. Research paper annotation problem is no exception. In general, research papers are annotated according to a system of topics, organized as a taxonomy, a hierarchy of topics (or concepts). For example the papers, published in journals of the international Association of Computing Machinery (ACM), the most influential organization in the Computer Science world, are annotated according to the Computing Classification System taxonomy (ACM CCS).
String-to-text relevance measures should be used to automate the research paper annotation procedure since taxonomy topics are strings ant research papers or any of their constituents are texts. A relevance measure maps a string–text pair to a real number. The meaning of the mapping depends on the relevance model under consideration. Under any model, the higher the relevance value, the stronger the association between the string and the text.
This paper explores the use of phrase-to-text relevance measures to annotate research papers in Computer Science by key phrases taken from the ACM Computing Classification System. Three phrase-to-text relevance measures are experimentally compared in this setting. The measures are: (a) cosine relevance score between conventional vector space representations of the texts coded with tf-idf weighting; (b) a popular characteristic of the probability of “elite” term generation BM25; and (c) a characteristic of the symbol conditional probability averaged over matching fragments in suffix trees representing texts and phrases, CPAMF, introduced by the authors. Our experiment is conducted over a set of texts published in journals of the ACM and manually annotated by their authors using topics from the ACM CCS. Applying any of the relevance measures to an article results in a list of taxonomy topics sorted in the descending order of their relevance values. The results are evaluated by comparing these sorted lists and lists of topics assigned to articles manually. The higher a manually assigned topic is placed in a relevance based sorted list of topics, the more accurate the sorted list is. The accuracy of the computational annotations is scored by using three different scoring functions: a) MAP, b) nDCG, c) Intersection at k, where (a) and (b) are taken from the literature, and (c) is introduced by the authors. It appears, CPAMF outperforms both the cosine measure and BM25 by a wide margin over all three scoring functions.
A vast amount of documents in the Web have duplicates, which is a challenge for developing efficient methods that would compute clusters of similar documents. In this paper we use an approach based on computing (closed) sets of attributes having large support (large extent) as clusters of similar documents. The method is tested in a series of computer experiments on large public collections of web documents and compared to other established methods and software, such as biclustering, on same datasets. Practical efficiency of different algorithms for computing frequent closed sets of attributes is compared.
Formal Concept Analysis (FCA) is an unsupervised clustering technique and many scientific papers are devoted to applying FCA in Information Retrieval (IR) research. We collected 103 papers published between 2003-2009 which mention FCA and information retrieval in the abstract, title or keywords. Using a prototype of our FCA-based toolset CORDIET, we converted the pdf-files containing the papers to plain text, indexed them with Lucene using a thesaurus containing terms related to FCA research and then created the concept lattice shown in this paper. We visualized, analyzed and explored the literature with concept lattices and discovered multiple interesting research streams in IR of which we give an extensive overview. The core contributions of this paper are the innovative application of FCA to the text mining of scientific papers and the survey of the FCA-based IR research.
The algorithm to adapt lexical complexity in the news article which can be used as materials for learning language presented in the paper. We consider words substitution retrieval according to wordnet-based and corpus-based semantic relatedness. Two corpus-based similarity measures empirically tested: Vector Space Model and Distributional Semantic Model. This language processing algorithm has created as a client-server application. It retrieves appropriate text from Web-resource. Next it performs adaptation procedure
The article discusses the strategy of «mixing» methods, particularly prevalent in the Western research tradition. Covers the methods of text analysis, demonstrated the difference between formal or approach on the example of the study of the image of modern Russia in the texts of the American edition of «New York Times», where attention is paid to algorithms work with texts. It is shown that for the study of such phenomena as the image of the country, the combination of formal or approaches to the analysis of the text is a necessary and natural research phenomenon.
Doctoral students were invited to the Doctoral Consortium held in conjunction with the main conference of ECIR 2013. The Doctoral Consortium aimed to provide a constructive setting for presentations and discussions of doctoral students’ research projects with senior researchers and other participating students. The two main goals of the Doctoral Consortium were: 1) to advise students regarding current critical issues in their research; and 2) to make students aware of the strengths and weakness of their research as viewed from different perspectives. The Doctoral Consortium was aimed for students in the middle of their thesis projects; at minimum, students ought to have formulated their research problem, theoretical framework and suggested methods, and at maximum, students ought to have just initiated data analysis. The Doctoral Consortium took place on Sunday, March 24, 2013, at the ECIR 2013 venue, and participation is by invitation only. The format was designed as follows: The doctoral students presents summaries of their work to other participating doctoral students and the senior researchers. Each presentation was followed by a plenary discussion, and individual discussion with one senior advising researcher. The discussions in the group and with the advisors were intended to help the doctoral student to reflect on and carry on with their thesis work.