?
Программный модуль для системы контроля ввозимой продукции животного происхождения
This paper shows the results of the development of a decision support system (DSS) for the Russian customs control information system. This system helps inspectors at the border veterinary checkpoints (BVC) of the Russian Federation to make decisions on the need for veterinary control of passing cargo. At present, the decision support system (DSS) developed with the participation of authors is used to solve this problem. The principle of its operation is based on the rules of fuzzy logic. The choice of this approach is due to the need for an inspector to verbally formulate the cause of the delay in the cargo, and not all approaches provide a detailed explanation of the decision, such as neural networks. To determine the level of veterinary hazard of countries and enterprises, a coefficient called ‘riskiness’ is used. This is a numerical value that is assigned to each enterprise, country, and type of product registered in the system. Their values can be changed by responsible employees in accordance with the situation in the respective regions at the current time. Thus, the decision can be explained, for example, as follows: ‘IF the riskiness of the enterprise is 86%, we send the goods for veterinary inspection’. But this approach has the following drawbacks: a small number of deducible rules of fuzzy logic, the solution does not cover all the factors that available for accounting, there is no control over the accuracy of the proposed solutions, there is no accounting for statistical information with the results of inspections. In this connection, it was necessary to improve the existing system of decision support for imported goods, which would take into account these shortcomings. The following system requirements were formulated:
1) Ensure the possibility of entering statistical information about enterprises, countries and the goods themselves. In addition, it is necessary to consider the number and the reasons for the inspections of such goods.
2) Ensure that information on a new consignment can be entered into a system for which a decision has not yet been made.
3) Ensure that the system can receive an explanation of the decision taken. This problem is common to all decision support systems.
4) Ensure that the system can be trained on the basis of examining the consequences of earlier decisions (for example, a ‘clean’ cargo can be checked, or a cargo that had a violation, as it later turned out, was missed).
To solve the problem of DSS it was decided to use the methods of statistical data analysis. The peculiarity of such methods is their complexity, due to the variety of forms of statistical regularities, as well as the complexity of the process of statistical research. There are several ways to solve such problems. When choosing the appropriate algorithm, one should rely on the following characteristics: accuracy, learning time, linearity, number of influencing factors, number of functions. To solve the problems of multiclass classification it is customary to use the data analysis methods based on the decision forests, logistic regression, neural networks and the ‘one against all’ method. From the point of view of accuracy the ‘decision forest’ is a priority. However, it is worth noting that each decision must be justified and submitted to the inspector with an appropriate report. This requirement is met only by logistic regression - the method of constructing a linear classifier that allows estimating the posteriori probabilities of belonging the objects to the classes. This method was chosen to create a decision support model. In order to explain to the inspector the reason for making a decision one can rely on the weight coefficients that the model places for each of the input parameters during the training. If we consider the examples of creating the systems of machine learning and processing the large amounts of data, then most of them are written in R or Python. The latter is chosen for solving the similar problems mainly because of the simplicity of the process of writing programs and the availability of fast mathematical libraries that allow to create the models and store the large amounts of data in RAM during the development process. The library for Python - NumPy was used to load, process and store the data in RAM. It expands the capabilities of the language for working with arrays, adds the support for working with large multidimensional matrices and also includes a number of fast high level mathematical functions for operations with these arrays. Sklearn library was used for training the model. While working on the mathematical model of the decision support module the Jupyter Notebook was used which is an interactive environment for creating the informative analytical reports. When developing the model a set of factors for analysis was formed and a mathematical model was constructed. When using the logistic regression as a learning algorithm it is advisable to use the following transformation: for each categorical attribute, add a new column and put 1 in those records that belong to this category, and the remaining lines will get the value 0. According to the practice, it significantly improves the accuracy of the model. The decision support model was tested in the process of its development. The percentage of discrepancy between the values predicted by the model and the actual values, has been calculated after applying the data transformation algorithms. As a result, it was determined that on the test data previously subjected to the necessary transformations the model is able to predict the inspector's decision with the accuracy of 95.1%. The received value is an acceptable result, since the final decision on the imported cargo still remains for the inspector. For final determination of the accuracy of the model, one more sample was used - the validation one, that is necessary to exclude the case when the model was adjusted for the specific test data. This may be due to the fact that the selected characteristics increase the accuracy of predictions only in specific cases encountered in the data for the test. The validation sample included the records of incoming cargoes created later than the sample records for training. Its size was 1000 lines. After checking the quality of the model on these data, the accuracy was 95.1%. The obtained results indicate that the algorithm has not been retrained and the selected features are adequately assessed. Thus, the developed decision support module fully meets the stated requirements. Further work to improve the DSS will be aimed at developing methods for preliminary data processing and searching for opportunities to increase the accuracy of the model. It should be noted that during the development process of the current algorithm the problem of correlation of the features was not solved. The eliminating dependencies between the input parameters will reduce their number, simplify the model and increase its accuracy.