?
Ансамбль современных моделей компьютерного зрения для задачи обнаружения дипфейков
This article explores the potential use of modern computer vision architectures for the task of deepfake detection. The following architectures are considered: EfficientNet, Vision Transformer (ViT), VisionLSTM (ViL), Vision KAN, and Mamba Vision. The novelty of the approach lies in the application and comparison of these architectures, as well as their combination into paired ensembles to improve the accuracy of deepfake detection. The study conducted an experiment based on the application of multiple architectures for image processing. Each architecture was used both individually and as part of an ensemble consisting of two models. The dataset for the experiment was created from video frames containing deepfakes, and these frames were subjected to various augmentations. The experimental results demonstrated that using ensembles of modern architectures improves the accuracy of deepfake recognition. The ensemble of ViT and VisionLSTM achieved an F1-score of 97.68%, which is higher than the performance of these architectures when used individually. However, not all ensembles resulted in improved metrics. For example, the combination of Mamba Vision and VisionLSTM showed a decrease in F1-score to 95.78% compared to using Mamba Vision alone. The research findings are valuable for professionals working in computer vision, cybersecurity, and multimedia content analysis. The proposed architectures and their ensembles can be effectively used in tasks related to deepfake detection and other forms of fake content, which is crucial for protection against information threats.