?
Enhancement of the Data Analysis Subsystem in the Task-Efficiency Monitoring System HPC TaskMaster for the cHARISMa Supercomputer Complex at HSE University
The detection of computational tasks that inefficiently utilize high-performance computing (HPC) resources is one of the major problems facing supercomputer centers. Such tasks can block valuable computational resources and slow down other supercomputer users’ computations. HPC TaskMaster, a task-performance monitoring system developed at the Higher School of Economics, addresses this issue by analyzing task metrics, aggregating them, calculating indicator values, assigning tags, and automatically generating inferences about task performance. In this paper, we describe the enhancement of the HPC TaskMaster subsystem for analyzing the efficiency of tasks by introducing a new entity into it: parameters. This extension enables the detection of new types of problems, such as the incorrect selection of the type and number of computational resources. Additionally, it allows one to consider the variability of parameters in the inferences generated by the system.