Развитие модели, основанной на знании об авторах, для поисковых применений
A new technology is proposed for wide search applications to natural language texts. Its particular application to an expert search task is considered in details on the example of TREC Enterprise track. The vocabulary is treated statistically, but, as opposed to a standard TFIDF metric, two special metrics are used. They involve into calculations information about lexicon usage by authors and communications between them. Calculating connection cardinality between an author and lexicon enables to reveal definite terms which are characteristic for an author so this author can be found with the help of such terms. Lexicon weighing allows to extract from the whole collection a small portion of vocabulary which we name significant. The significant lexicon enables to effectively search in thematically specialized knowledge field. Thus, our search engine minimizes the lexicon necessary for answering a query by extracting the most important part from it. The ranking function takes into account term usage statistics among authors to raise role of significant terms in comparison with others, more noisy ones. We demonstrate the possibility of effective expertise retrieval owing to several rationally built heuristic rating indicators. First, we receive an expert search efficiency that is comparable with the most effective modern information retrieval engines. Second, the chosen indicators allow to distinguish between “good” and “bad” queries. This is essentially important for further optimization of our engine. We discuss the possibility of applying our engine to other search and analytic scenarios such as plagiarism search, information gap retrieval and others.