### Book chapter

## Term Weighting in Expert Search Task: Analyzing Communication Patterns?

The goal of the expert search task is nding knowledgeable persons within the enterprise. In this paper we focus on its distinctions from the other information retrieval tasks. We review the existing approaches and propose a new term weighting scheme which is based on analysis of communication patterns between people. The eectiveness of the proposed approach is evaluated on a collection of e-mails from an organization of approximately 1500 people. Results show that it is possible to take into account communication structure in the process of term weighting, eectively combining communication-based and document-based approaches to expert finding.

### In book

Let G = (V,E) be an undirected graph, T ⊆ V be a set of terminals. Then a natural combinatorial problem consists in finding the maximum number of vertex-disjoint paths connecting distinct terminals. For this problem, a clever construction suggested by Gallai reduces it to computing a maximum non-bipartite matching and thus gives an O ( m √n log n 2 /m log n ) -Time algorithm (hereinafter n := |V |, m := |E|). Now let us consider the fractional relaxation, i.e. allow T-path packings with arbitrary nonnegative real weights. It is known that there always exists a half-integral solution, that is, one only needs to assign weights 0, 1/2 , 1 to maximize the total weight of T-paths. It is also known that an optimum half-integral packing can be found in strongly-polynomial time but the actual time bounds are far from being satisfactory. In this paper we present a novel algorithm that solves the half-integral problem within O ( m √n log n 2 /m log n )time, thus matching the complexities of integral and half-integral versions.

This book constitutes the proceedings of the 15th International Computer Science Symposium in Russia, CSR 2020, held in Yekaterinburg, Russia, in June 2020.

The 25 full papers and 6 invited papers were carefully reviewed and selected from 49 submissions. The papers cover a broad range of topics, such as: algorithms and data structures; computational complexity, including hardness of approximation and parameterized complexity; randomness in computing, approximation algorithms, fixed-parameter algorithms; combinatorial optimization, constraint satisfaction, operations research; computational geometry; string algorithms; formal languages and automata, including applications to computational linguistics; codes and cryptography; combinatorics in computer science; computational biology; applications of logic to computer science, proof complexity; database theory; distributed computing; fundamentals of machine learning, including learning theory, grammatical inference and neural computing; computational social choice; quantum computing and quantum cryptography; theoretical aspects of big data.

The conference was cancelled as a live conference due to the corona pandemic.

We consider an undirected graph $G = (VG, EG)$ with a set $T \subseteq VG$ of terminals, and with nonnegative integer capacities $c(v)$ and costs $a(v)$ of nodes $v\in VG$. A path in $G$ is a \emph{$T$-path} if its ends are distinct terminals. By a \emph{multiflow} we mean a function $F$ assigning to each $T$-path $P$ a nonnegative rational \emph{weight} $F(P)$, and a multiflow is called \emph{feasible} if the sum of weights of $T$-paths through each node $v$ does not exceed $c(v)$. The emph{value} of $F$ is the sum of weights $F(P)$, and the \emph{cost} of $F$ is the sum of $F(P)$ times the cost of $P$ w.r.t. $a$, over all $T$-paths $P$. Generalizing known results on edge-capacitated multiflows, we show that the problem of finding a minimum cost multiflow among the feasible multiflows of maximum possible value admits \emph{half-integer} optimal primal and dual solutions. Moreover, we devise a strongly polynomial algorithm for finding such optimal solutions.

We present the results of our enterprise expert search system application to the tasks that were introduced at the Text Retrieval Conference (TREC) in 2005—2007. The expert search system is based on the analysis of content and communications topology in an enterprise information space. During the performed experiments an optimal set of weighting coefficients for three query-candidate associating algorithms is selected for achieving the best search efficiency on a specified corpus. The obtained performance proved to be better than at most TREC participants. The hypothesis of additional efficiency improvement by means of query classification is proposed.

Given a digraph $G = (VG,AG)$, an even factor $M \subseteq AG$ is a set formed by node-disjoint paths and even cycles. Even factors in digraphs were introduced by Geelen and Cunningham and generalize path matchings in undirected graphs. Finding an even factor of maximum cardinality in a general digraph is known to be NP-hard but for the class of odd-cycle symmetric digraphs the problem is polynomially solvable. So far the only combinatorial algorithm known for this task is due to Pap; its running time is $O(n^4)$ (hereinafter $n$ denotes the number of nodes in $G$ and $m$ denotes the number of arcs or edges). In this paper we introduce a novel sparse recovery technique and devise an $O(n^3 \log n)$-time algorithm for finding a maximum cardinality even factor in an odd-cycle symmetric digraph. Our technique also applies to other similar problems, e.g. finding a maximum cardinality square-free simple $b$-matching.

In this chapter, the results of the application of a model enterprise expert search system applied to the tasks introduced at the text retrieval conference (TREC) are presented. Two specific indicators are used in order to treat the lexicon statistically. Calculating lexicon-candidate connection power enables one to reveal definite terms, which are characteristic for a candidate, so this candidate can be found by such terms. Calculating the weight of the lexicon allows the extraction from the whole collection of a small portion of vocabulary, which is identified as significant. The significant lexicon enables one to perform an effective search in thematically specialised knowledge fields. Thus, the search engine minimises the lexicon necessary for answering a query by extracting the most important part from it. The ranking function takes into account term-usage statistics among candidates to raise the role of significant terms in comparison to others, and more noisy ones. In describing the application of the model presented, the possibility of effective expertise retrieval by merging several heuristic ranking metrics into a single weighting model is demonstrated. To enhance the search efficiency, the model is optimised by its free parameters. The shown efficiency is better than that of most TREC participant models. A further efficiency improvement by means of query classification is proposed.

The goal of the expert search task is finding knowledgeable persons within the enterprise. In this paper we focus on its distinctions from the other information retrieval tasks. We review the existing ap- proaches and propose a new term weighting scheme which is based on analysis of communication patterns between people. The effectiveness of the proposed approach is evaluated on a collection of e-mails from an organization of approximately 1500 people. Results show that it is possible to take into account communication structure in the process of term weighting, effectively combining communication-based and document-based approaches to expert finding.

This book constitutes the refereed post-conference proceedings of the 29th International Workshop on Combinatorial Algorithms, IWOCA 2018, held in Singapore, Singapore, in July 2018. The 31 regular papers presented in this volume were carefully reviewed and selected from 69 submissions. They cover diverse areas of combinatorical algorithms, complexity theory, graph theory and combinatorics, combinatorial optimization, cryptography and information security, algorithms on strings and graphs, graph drawing and labelling, computational algebra and geometry, computational biology, probabilistic and randomised algorithms, algorithms for big data analytics, and new paradigms of computation.