?
Анализ влияния обфускации входных данных на эффективность языковых моделей в обнаружении инъекции подсказок
The article addresses the issue of prompt obfuscation as a means of circumventing protective mechanisms in large language models (LLMs) designed to detect prompt injections. Prompt injections represent a method of attack in which malicious actors manipulate input data to alter the model's behavior and cause it to perform undesirable or harmful actions. Obfuscation involves various methods of changing the structure and content of text, such as replacing words with synonyms, scrambling letters in words, inserting random characters, and others. The purpose of obfuscation is to complicate the analysis and classification of text in order to bypass filters and protective mechanisms built into language models. The study conducts an analysis of the effectiveness of various obfuscation methods in bypassing models trained for text classification tasks. Particular attention is paid to assessing the potential implications of obfuscation for security and data protection. The research utilizes different text obfuscation methods applied to prompts from the AdvBench dataset. The effectiveness of the methods is evaluated using three classifier models trained to detect prompt injections. The scientific novelty of the research lies in analyzing the impact of prompt obfuscation on the effectiveness of language models in detecting prompt injections. During the study, it was found that the application of complex obfuscation methods increases the proportion of requests classified as injections, highlighting the need for a thorough approach to testing the security of large language models. The conclusions of the research indicate the importance of balancing the complexity of the obfuscation method with its effectiveness in the context of attacks on models. Excessively complex obfuscation methods may increase the likelihood of injection detection, which requires further investigation to optimize approaches to ensuring the security of language models. The results underline the need for the continuous improvement of protective mechanisms and the development of new methods for detecting and preventing attacks on large language models.