• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Article

Автоматизация обнаружения и анализа ошибок в гиперконвергентных системах

The paper is devoted to the problem of early error detection and analysis in hyperconverged systems. One approach to organizing hyperconverged systems is to install on each physical server a separate instance of an operating system (OS) that carries virtualization tools and tools for administering and using a distributed data warehouse. Errors can occur both at the level of a single OS instance and at the level of the entire cluster. For example, incorrect control element commands from one infrastructure node can cause software failure on another node. In addition, errors from the subsystems of the cluster can provoke abnormal situations inside virtual machines. The complexity of the architecture of hyperconverged systems makes it difficult to analyze the errors that occur in them. To simplify such an analysis and increase its effectiveness, it is necessary to automate the process of detecting problems and collecting data necessary for their study and correction. Existing approaches for automation of error detection are described and various improvements are suggested to adopt them for systems where distributed storage and virtualization technologies are actively used. Improvements include log collection from the whole cluster just after the error occurred, additional analysis of guest operating system behaviour inside virtual machines, usage of a knowledge base for automated crash recovery and duplicate detection. Finally, a real-life scenario of error handling process in Virtuozzo company products is described starting from error detection and ending with fix deployment.