Инструментальные средства для разработки систем извлечения информации из русскоязычных текстов
Semantic web technologies promise to bring companies closer to their customers and deliver to consumers more relevant content than ever before. Two technologies in particular will help build sustainable advantage for the investor relations team. The first is natural language processing and second content enhancement. Intuitively, semantic content should help establish a higher quality of communication between information providers and consumers. This chapter describes the state-of-the-art in digital text information extraction, specifically the application of semantic technology to confront the challenges of the investor relations department. We discuss the roots of human language technology and ontology-driven information extraction and how such extracted semantic metadata can be used for better decision making, market monitoring and competitor intelligence. We will consider ontology as a sound semantic platform for defining the meaning of content and consequently supporting the prudent access to data for business intelligence. Examples are given on dynamic hypertext views, a solution that links different web pages together based on their semantic meaning. The foundation of the proposed solution relies on an ontology-driven information extraction approach, a framework that merges same entities and stores the semantic metadata in a knowledge base. This framework supports the complete transformation process, including web page crawling, the extraction of knowledge, the creation of unique identifiers and presentations offering access to the portal. In this context, we describe how these technologies are being used in real customer scenarios and compare the classical search approach to a more intelligent approach based on ontology and information extraction. In particular, we describe semantic indexing, building a knowledge base from various sources and give an introduction on how to create domain ontology based on customer queries. Then we tackle issues of merging information from text with semi-structured information from the Web, highlighting the relation to Linked Data using standards like RDF/XML. Finally, we present possible user interfaces which display the aggregated semantic metadata inside a portal and other third party software tools. The chapter concludes by looking beyond the current solution to how semantic technology will add more information in the near future, including a short survey of recent thinking that offers potential extensions to today’s model.
This paper presents a rule-based approach to Information Extraction (IE) task within FactRuEval-2016 competition. Our system is based on ABBYY Compreno Technology. The technology uses the results of deep syntactic-semantic analysis, which leads to significant reduction of the number of necessary rules and makes them laconic. The evaluation was conducted on FactRuEval dataset. FactRuEval is an open evaluation of IE systems. The participants could take part in three tracks. The first track required to detect the boundaries and type of named entities in a text. The second track required to extract normalized attributes and perform local identification of named entities. The third track required to extract facts of certain types from a text. We took part in all three of the tracks with the nickname violet. Our method proved to be successful: we have achieved high F-measures in Named Entity Recognition tracks and the highest F-measure in Fact Extraction track.