Abstract. Research and development (R&D) involves not only researchers but also many other specialists from different areas. All of them solve a variety of tasks that require comprehensive information and analytical support. This chapter discusses the major tasks arising in R&D: study of the state of the art in a given research area, prospects assessment of research fields and forecasting their development, quality assessment of scientific publications including plagiarism detection, and automated examination of proposed R&D projects. A number of informational and analytical systems have been developed to address these tasks. The main goal of this chapter is to give a review of R&D support functions of well-known and widely-used search and analytical systems and discuss information retrieval methods behind these functions. Keywords: Full-text search, information retrieval, R&D support, scientific publication, citation databases, scientometrics, exploratory search.
In recent years there have been a number of important improvements in exact color-based maximum clique solvers, which have considerably enhanced their performance. Initial vertex ordering is one strategy known to have a significant impact on the size of the search tree. Typically, a degenerate sorting by minimum degree is used; literature also reports different tiebreaking strategies. A systematic study of the impact of initial sorting in the light of new cutting-edge ideas (e.g. recoloring , selective coloring , ILS initial lower bound computation [15, 16] or MaxSAT-based pruning ) is, however, lacking. This paper presents a new initial sorting procedure and relates performance to the new mentioned variants implemented in leading solver BBMC [9, 10].
In the paper we address a challenging problem of incorporating preferences on possible shapes of an object in a binary image segmentation framework. We extend the well-known conditional random fields model by adding new variables that are responsible for the shape of an object. We describe the shape via a flexible graph augmented with vertex positions and edge widths. We derive exact and approximate algorithms for MAP estimation of label and shape variables given an image. An original learning procedure for tuning parameters of our model based on unlabeled images with only shape descriptions given is also presented. Experiments confirm that our model improves the segmentation quality in hard-to-segment images by taking into account the knowledge about typical shapes of the object.
The ORD corpus is a representative resource of everyday spoken Russian that contains about 1000 h of long-term audio recordings of daily communication made in real settings by research volunteers. ORD macro episodes are the large communication episodes united by setting/scene of communication, social roles of participants and their general activity. The paper describes annotation principles used for tagging of macro episodes, provides current statistics on communication situations presented in the corpus and reveals their most common types. Annotation of communication situations allows using these codes as filters for selection of audio data, therefore making it possible to study Russian everyday speech in different communication situations, to determine and describe various registers of spoken Russian. As an example, several high frequency word lists referring to different communication situations are compared. Annotation of macro episodes that is made for the ORD corpus is a prerequisite for its further pragmatic annotation.
The open source C++ class library GridMD for distributed computing is reviewed including its architecture, functionality and use cases. The library is intended to facilitate development of distributed applications that can be run at contemporary supercomputing clusters and standalone servers managed by Grid or cluster task scheduling middleware. The GridMD library used to be targeted at molecular dynamics and Monte-Carlo simulations but at present it can serve as a universal tool for developing distributed computing applications as well as for creating task management codes. In both cases the distributed application is represented by a single client-side executable built from a compact C++ code. In the first place the library is targeted at developing complex applications that contain many computation stages with possible data dependencies between them which can be run efficiently in the distributed environment.
The paper presents a new geometrically motivated method for non-linear regression based on Manifold learning technique. The regression problem is to construct a predictive function which estimates an unknown smooth mapping f from q-dimensional inputs to m-dimensional outputs based on a training data set consisting of given ‘input-output’ pairs. The unknown mapping f determines q-dimensional manifold M(f) consisting of all the ‘input-output’ vectors which is embedded in (q+m)-dimensional space and covered by a single chart; the training data set determines a sample from this manifold. Modern Manifold Learning methods allow constructing the certain estimator M* from the manifold-valued sample which accurately approximates the manifold. The proposed method called Manifold Learning Regression (MLR) finds the predictive function fMLR to ensure an equality M(fMLR) = M*. The MLR simultaneously estimates the m×q Jacobian matrix of the mapping f.
In this paper, we analyze a new approach for demand prediction in retail. One of the signicant gaps in demand prediction by machine learning methods is the unaccounted sales data censorship. Econometric approaches to modeling censored demand are used to obtain consistent and unbiased estimates of parameters. These approaches can also be transferred to different classes of machine learning models to reduce the prediction error of sales volume. In this study we build two ensemble models to predict demand with and without demand censorship, aggregating predictions for machine learning methods such as Linear regression, Ridge regression, LASSO and Random forest. Having estimated the predictive properties of both models, we test the best predictive power of the models with accounting for the censored nature of demand.
Structured-output learning is a challenging problem; particularly so because of the difficulty in obtaining large datasets of fully labelled instances for training. In this paper we try to overcome this difficulty by presenting a multi-utility learning framework for structured prediction that can learn from training instances with different forms of supervision. We propose a unified technique for inferring the loss functions most suitable for quantifying the consistency of solutions with the given weak annotation. We demonstrate the effectiveness of our framework on the challenging semantic image segmentation problem for which a wide variety of annotations can be used. For instance, the popular training datasets for semantic segmentation are composed of images with hard-to-generate full pixel labellings, as well as images with easy-to-obtain weak annotations, such as bounding boxes around objects, or image-level labels that specify which object categories are present in an image. Experimental evaluation shows that the use of annotation-specific loss functions dramatically improves segmentation accuracy compared to the baseline system where only one type of weak annotation is used.
We proposed a prototype of near-duplicate detection system for web-shop owners. It’s a typical situation for this online businesses to buy description of their goods from so-called copyrighters. Copyrighter can cheat from time to time and provide the owner with some almost identical descriptions for different items. In this paper we demonstrated how we can use FCA for fast clustering and revealing such duplicates in real online perfume shop’s datasets.
A hybrid approach to automated identification and monitoring of technology trends is presented. The hybrid approach combines methods of ontology based information extraction and statistical methods for processing OBIE results. The key point of the approach is the so called ‘black box’ principle. It is related to identification of trends on the basis of heuristics stemming from an elaborate ontology of a technology trend.
Conventional image recognition methods usually include dividing the keypoint neighborhood (for local features) or the whole object (for global features) into a grid of blocks, computing the gradient magnitude and orientation at each image sample point and uniting the orientation histograms of all blocks into a single descriptor. The query image is recognized by matching its descriptors with the descriptors of reference images. The matching is usually done by summation of distances between descriptors of corresponding blocks. Unfortunately, such approach does not lead to a correct distance between vector of points (histograms of each block) if popular square of Euclidean distance is used as a discrimination. To calculate the correct discrimination, we propose to sum the square roots (or, more generally, appropriate nonlinear transformation) of distances between block histograms. Such approach is experimentally examined in a face recognition problem with FERET and AT&T datasets. The results support the statement that the proposed approach provides higher accuracy (up to 5.5%) than state-of-the-art methods not only for the Euclidean distance but for other popular similarity measures (L 1, Kullback-Leibler, Jensen-Shannon, chi-squared and homogeneity-testing probabilistic neural network).
Several notions of links between contexts – intensionally related concepts, shared intents, and bonds, as well as interrelations thereof – are considered. Algorithmic complexity of the problems related to respective closure operators are studied. The expression of bonds in terms of shared intents is given.
Lexicographically minimal and lexicographically maximal suffixes of a string are fundamental notions of stringology. It is well known that the lexicographically minimal and maximal suffixes of a given string S can be computed in linear time and space by constructing a suffix tree or a suffix array of S. Here we consider the case when S is a substring of another string T of length n.We propose two linear-space data structures for T which allow to compute the minimal suffix of S in O(log^1+ε n) time (for any fixed ε > 0) and the maximal suffix of S in O(log n) time. Both data structures take O(n) time to construct.
In this paper we propose a woven block code construction based on two convolutional codes. We also propose a soft-input decoder that allows this construction to have better error correction performance than the turbo codes with a conventional decoder. Computer simulation has showed a 0.1 dB energy gain relative to the LTE turbo code. Asymptotically the proposed code has distance greater than the product of free distances of component codes.
We construct a mathematical model of anti-virus protection of local area networks. The model belongs to the class of regenerative processes. To protect the network from the external attacks of viruses and the spread of viruses within the network we apply two methods: updating antivirus signatures and reinstallings of operating systems (OS). Operating systems are reinstalled in the case of failure of any of the computers (non-scheduled emergent reinstalling) or at scheduled time moments. We consider a maximization problem of an average unit income. The cumulative distribution function (CDF) of the scheduled intervals between complete OS reinstallings is considered as a control. We prove that the optimal CDF has to be degenerate, i.e., it is localized at a point ττ.
Gaussian graphical model selection is a statistical problem
that identifies the Gaussian graphical model from observations. Existing
Gaussian graphical model selection methods focus on the error rate
for incorrect edge inclusion. However, when comparing statistical procedures,
it is also important to take into account the error rate for
incorrect edge exclusion. To handle this issue we consider the graphical
model selection problem in the framework of multiple decision theory.We
show that the statistical procedure based on simultaneous inference with
UMPU individual tests is optimal in the class of unbiased procedures.