### Working paper

## Entropy analysis of n-grams and estimation of the number of meaningful language texts. Cyber security applications

This book constitutes the refereed proceedings of the 5th International Castle Meeting on Coding Theory and Applications, ICMCTA 2017, held in Vihula, Estonia, in August 2017.

The 24 full papers presented were carefully reviewed and selected for inclusion in this volume. The papers cover relevant research areas in modern coding theory, including codes and combinatorial structures, algebraic geometric codes, group codes, convolutional codes, network coding, other applications to communications, and applications of coding theory in cryptography.

We investigate one possible generalization of locally recoverable codes (LRC) with all-symbol locality and availability when recovering sets can intersect in a small number of coordinates. This feature allows us to increase the achievable code rate and still meet load balancing requirements. In this paper we derive an upper bound for the rate of such codes and give explicit constructions of codes with such a property. These constructions utilize LRC codes developed by Wang et al.

Understanding the relation between (sensory) stimuli and the activity of neurons (i.e., "the neural code") lies at heart of understanding the computational properties of the brain. However, quantifying the information between a stimulus and a spike train has proven to be challenging. We propose a new (in vitro) method to measure how much information a single neuron transfers from the input it receives to its output spike train. The input is generated by an artificial neural network that responds to a randomly appearing and disappearing "sensory stimulus": the hidden state. The sum of this network activity is injected as current input into the neuron under investigation. The mutual information between the hidden state on the one hand and spike trains of the artificial network or the recorded spike train on the other hand can easily be estimated due to the binary shape of the hidden state. The characteristics of the input current, such as the time constant as a result of the (dis)appearance rate of the hidden state or the amplitude of the input current (the firing frequency of the neurons in the artificial network), can independently be varied. As an example, we apply this method to pyramidal neurons in the CA1 of mouse hippocampi and compare the recorded spike trains to the optimal response of the "Bayesian neuron" (BN). We conclude that like in the BN, information transfer in hippocampal pyramidal cells is non-linear and amplifying: the information loss between the artificial input and the output spike train is high if the input to the neuron (the firing of the artificial network) is not very informative about the hidden state. If the input to the neuron does contain a lot of information about the hidden state, the information loss is low. Moreover, neurons increase their firing rates in case the (dis)appearance rate is high, so that the (relative) amount of transferred information stays constant.

We address the problem of constructing coding schemes for the channels with high-order modulations. It is known, that non-binary LDPC codes are especially good for such channels and significantly outperform their binary counterparts. Unfortunately, their decoding complexity is still large. In order to reduce the decoding complexity, we consider multilevel coding schemes based on non-binary LDPC codes (NB-LDPC-MLC schemes) over smaller fields. The use of such schemes gives us a reasonable gain in complexity. At the same time, the performance of NB-LDPC-MLC schemes is practically the same as the performance of LDPC codes over the field matching the modulation order. In particular, by means of simulations, we showed that the performance of NB-LDPC-MLC schemes over GF(16) is the same as the performance of non-binary LDPC codes over GF(64) and GF(256) in AWGN channel with QAM 64 and QAM 256 accordingly. We also perform a comparison with bit-interleaved coded modulation based on binary LDPC codes.

Consider a Bayesian problem of estimating of probability of success in a series of trials with binary outcomes. We study the asymp- totic behaviour of weighted differential entropy for posterior probability density function (PDF) conditional on x successes after n trials, when n → ∞. Suppose that one is interested to know whether the coin is fair or not and for large n is interested in true frequency. In other words, one wants to emphasize the parameter value p = 1/2. To do so the concept of weighted differential entropy introduced in [1968] is used when the frequency γ is necessary to emphasize. It was found that the weight in suggested form does not change the asymptotic form of Shannon, Renyi, Tsallis and Fisher entropies, but change the constants. The leading term in weighted Fisher Information is changed by some constant which depend on distance between the true frequency and the value we want to emphasize.

A words phonetic decoding method in automatic speech recognition is considered. The properties of Kullback–Leibler divergence are used to synthesize the estimation of the distribution of divergence between minimum speech units (e.g., single phonemes) inside a single class. It is demonstrated that the min imum variance of the intraphonemic divergence is reached when the phonetic database is tuned to the voice of a single speaker. The estimations are proven by experimental results on the recognition of vowel sounds and isolated words of Russian language.

We establish a new upper bound for the Kullback-Leibler divergence of two discrete probability distributions which

are close in a sense that typically the ratio of probabilities is nearly one and the number of outliers is small.

A model for organizing cargo transportation between two node stations connected by a railway line which contains a certain number of intermediate stations is considered. The movement of cargo is in one direction. Such a situation may occur, for example, if one of the node stations is located in a region which produce raw material for manufacturing industry located in another region, and there is another node station. The organization of freight traﬃc is performed by means of a number of technologies. These technologies determine the rules for taking on cargo at the initial node station, the rules of interaction between neighboring stations, as well as the rule of distribution of cargo to the ﬁnal node stations. The process of cargo transportation is followed by the set rule of control. For such a model, one must determine possible modes of cargo transportation and describe their properties. This model is described by a ﬁnite-dimensional system of diﬀerential equations with nonlocal linear restrictions. The class of the solution satisfying nonlocal linear restrictions is extremely narrow. It results in the need for the “correct” extension of solutions of a system of diﬀerential equations to a class of quasi-solutions having the distinctive feature of gaps in a countable number of points. It was possible numerically using the Runge–Kutta method of the fourth order to build these quasi-solutions and determine their rate of growth. Let us note that in the technical plan the main complexity consisted in obtaining quasi-solutions satisfying the nonlocal linear restrictions. Furthermore, we investigated the dependence of quasi-solutions and, in particular, sizes of gaps (jumps) of solutions on a number of parameters of the model characterizing a rule of control, technologies for transportation of cargo and intensity of giving of cargo on a node station.

Event logs collected by modern information and technical systems usually contain enough data for automated process models discovery. A variety of algorithms was developed for process models discovery, conformance checking, log to model alignment, comparison of process models, etc., nevertheless a quick analysis of ad-hoc selected parts of a journal still have not get a full-fledged implementation. This paper describes an ROLAP-based method of multidimensional event logs storage for process mining. The result of the analysis of the journal is visualized as directed graph representing the union of all possible event sequences, ranked by their occurrence probability. Our implementation allows the analyst to discover process models for sublogs defined by ad-hoc selection of criteria and value of occurrence probability

The geographic information system (GIS) is based on the first and only Russian Imperial Census of 1897 and the First All-Union Census of the Soviet Union of 1926. The GIS features vector data (shapefiles) of allprovinces of the two states. For the 1897 census, there is information about linguistic, religious, and social estate groups. The part based on the 1926 census features nationality. Both shapefiles include information on gender, rural and urban population. The GIS allows for producing any necessary maps for individual studies of the period which require the administrative boundaries and demographic information.

Existing approaches suggest that IT strategy should be a reflection of business strategy. However, actually organisations do not often follow business strategy even if it is formally declared. In these conditions, IT strategy can be viewed not as a plan, but as an organisational shared view on the role of information systems. This approach generally reflects only a top-down perspective of IT strategy. So, it can be supplemented by a strategic behaviour pattern (i.e., more or less standard response to a changes that is formed as result of previous experience) to implement bottom-up approach. Two components that can help to establish effective reaction regarding new initiatives in IT are proposed here: model of IT-related decision making, and efficiency measurement metric to estimate maturity of business processes and appropriate IT. Usage of proposed tools is demonstrated in practical cases.