Photo privacy detection based on text classification and face clustering

Kopeykina L.; A. Savchenko

?

Photo privacy detection based on text classification and face clustering

Ch. 39. P. 171–176.

Nowadays, the photo privacy detection is becoming an acute task due to a wide spread of mobile devices with photos published on social networks. As a photo might contain private or sensitive data, there is an urgent need to accurately determine them and impose restrictions on their processing. In this paper we focus on the task of personal data detection in a photo gallery. A novel two-stage approach is proposed. At first, text of scanned documents is processed based on an EAST text detector, and extracted text is recognized using Tesseract and neural network classifier. At the second stage, face clustering is implemented for the remaining photos to identify large groups of people (friends, relatives) whose photos also refer to personal data and must be processed directly on a mobile device. The remaining images can be sent to a remote server for processing with higher accuracy. The experimental results of text recognition and face clustering methods using various convolutional networks for facial features extraction are presented.

Language: English

Full text

Text on another site

Keywords: text classification data privacy Facial clustering кластеризация лиц классификация текста text detection детектирование текста на изображениях

Publication based on the results of:

Эффективные методы распознавания мультимедийных данных для задач анализа предпочтений пользователей мобильных устройств (2019)

In book

Proceedings of the VI International conference Information Technology and Nanotechnology. Session Image Processing and Earth Remote Sensing (ITNT-IPERS)

Vol. 2665: Information Technology and Nanotechnology. Image Processing and Earth Remote Sensing 2020. , Samara: CEUR Workshop Proceedings, 2020.

Determinants of Сonsent to Personal Data Surveillance: Experimental Evidence from Russia

Sizov A., Rodionova M., Sedashov E. et al., / NRU Higher School of Economics. Series PS "Political Science". 2026. No. 1.

Rapid development of surveillance technologies is one of the most socially important consequences of the digital age. This paper investigates the factors determining consent to surveillance of various types of personal data and contributes to rapidly growing research on citizens perceptions of surveillance practices. Relying on a comprehensive survey experiment, we study the effects of ...

Added: May 15, 2026

Дискриминативная лемматизация сокращений в эпоху LLM

Глазкова А. В., Смаль И. В., Lyashevskaya O. et al., Доклады Российской академии наук. Математика, информатика, процессы управления (ранее - Доклады Академии Наук. Математика) 2025 Т. 527 С. 146–155

This paper presents a study on the effectiveness of discriminative methods for abbreviation lemmatization in Russian texts. Unlike generative approaches, discriminative models select the optimal lemma from a fixed set of candidates, eliminating the risk of generating grammatically incorrect word forms. For the first time in Russian language processing, we conduct a comprehensive analysis of ...

Added: March 10, 2026

Transformer-based approaches for lemmatizing abbreviations in Russian texts

Glazkova A., Lyashevskaya O., Morozov D. et al., Journal of Mathematical Sciences 2025 Vol. 546 P. 32–47

This paper addresses the task of lemmatizing abbreviations in the Russian language. Abbreviation lemmatization is particularly challenging, as it involves not only transforming a word into its normal form but also correctly expanding the abbreviation. We explore two approaches to this task, both leveraging large pretrained language models. The first approach is generative, where the ...

Added: March 10, 2026

Кодекс этики в сфере искусственного интеллекта в медицине и здравоохранении

Абрамова А. В., Белоусова Е. Н., Ватюков С. Е. et al., Проблемы стандартизации в здравоохранении 2025 № 5-6 С. 3–14

The improvement of artificial intelligence (AI) technologies and their rapid integration into the socially and economically significant medical industry create broad prospects for ensuring accessibility and quality of medical care, while at the same time creating new challenges related to the safety and ethical risks of using innovative solutions. This creates the need to develop ...

Added: December 7, 2025

Stalactite: toolbox for fast prototyping of vertical federated learning systems

Zakharova A., Alexandrov D., Khodorchenko M. et al., , in: RecSys '24: Proceedings of the 18th ACM Conference on Recommender Systems.: Association for Computing Machinery (ACM), 2024. P. 1187–1190.

Machine learning (ML) models trained on datasets owned by different organizations and physically located in remote databases offer benefits in many real-world use cases. State regulations or business requirements often prevent data transfer to a central location, making it difficult to utilize standard machine learning algorithms. Federated Learning (FL) is a technique that enables models ...

Added: November 24, 2024

Индекс «этичности» систем искусственного интеллекта в медицине: от теории к практике

Ugleva A. V., Shilova V. A., Карпова Е. А., Этическая мысль 2024 Т. 24 № 1 С. 144–159

The article presents the methodology developed in the HSE University – Index of EthicsofArtificial Intelligence Systems. The task of developing this Index was to assess real andpossible ethical risks arising at all stages of the life cycle of AI systems. The system itselfdoes not possess any “ethics”, while socially acceptable, morally permissible, and necessarymay be ...

Added: July 15, 2024

Эмоциональный анализ постов в ВКонтакте: классификатор или регрессор

Kolmogorova A., Калинин А. А., В кн.: Компьютерная лингвистика и интеллектуальные технологии: по материалам международной конференции «Диалог 2022», выпуск 21Вып. 21.: Изд-во РГГУ, 2022. С. 311–322.

The article summarizes the results of two tasks in machine learning paradigm: the task of classification according to the criterion of dominating emotion on the data of social networks posts in Russian and the regression task using the same data. The experiments are conducted on the data set collected from VKontakte social network and consisted of 3879 posts ...

Added: March 18, 2024

Machine learning approach for scientific and technical expertise

A. V. Belov, E. A. Egorova, Bulletin D. Serikbayev East Kazakhstan Technical University 2023 No. 4 P. 92–102

When conducting scientific and technical expertise, it is necessary to analyze the texts of reports on scientific research work. The analysis is carried out in order to determine whether the research being conducted belongs to the class of scientific research and development work in the field of IT. This article discusses the tasks of binary ...

Added: March 9, 2024

Classification of Short Scientific Texts

I. K. Kusakin, Fedorets O. V., A. Y. Romanov, Scientific and Technical Information Processing 2023 Vol. 50 No. 3 P. 176–183

This paper discusses modern approaches to natural language processing and the application of machine learning models to the task of classifying short scientific texts in Russian. This study is devoted to the analysis of methods for vectorization of textual information, selection of a model for scientific paper clas- sification, and training of linguistic model BERT ...

Added: November 4, 2023

Secure Codes With Accessibility for Distributed Storage

Holzbaur L., Kruglik S., Frolov A. et al., IEEE Transactions on Information Forensics and Security 2021 Vol. 16 P. 5326–5337

A distributed storage system must support efficient access to stored data while ensuring recovery of temporally unavailable nodes. Another important aspect of a distributed storage system is security. In this paper, we bring these features together and investigate the problem of efficient access to stored data in presence of a passive eavesdropper with access to ...

Added: September 9, 2023

The Scope of the Personal Data Concept in Russia

Зюбанов К. А., Legal Issues in the Digital Age 2023 Vol. 4 No. 1 P. 53–76

Personal data as an institution is gaining increasing attention on the part of both public authorities, business structures and private individuals as subjects of personal data. Meanwhile, an efficient and successful usage of the tools provided by this institution directly depends on whether the scope of the personal data concept can be unambiguously defined. The paper describes ...

Added: May 8, 2023

Near-Zero-Shot Suggestion Mining with a Little Help from WordNet

Alekseev A., Tutubalina E., Kwon S. et al., , in: Analysis of Images, Social Networks and Texts. 10th International Conference, AIST 2021, Tbilisi, Georgia, December 16–18, 2021, Revised Selected Papers.: Cham: Springer, 2022. P. 23–36.

In this work we explore the constructive side of online reviews: advice, tips, requests, and suggestions that users provide about goods, venues and other items of interest. To reduce training costs and annotation efforts needed to build a classifier for a specific label set, we present and evaluate several entailment-based zero-shot approaches to suggestion classification ...

Added: April 10, 2023

Selection of Pseudo-Annotated Data for Adverse Drug Reaction Classification Across Drug Groups

Alimova I., Tutubalina E., , in: Analysis of Images, Social Networks and Texts. 10th International Conference, AIST 2021, Tbilisi, Georgia, December 16–18, 2021, Revised Selected Papers.: Cham: Springer, 2022. P. 37–44.

Automatic monitoring of adverse drug events (ADEs) or reactions (ADRs) is currently receiving significant attention from the biomedical community. In recent years, user-generated data on social media has become a valuable resource for this task. Neural models have achieved impressive performance on automatic text classification for ADR detection. Yet, training and evaluation of these methods ...

Added: April 10, 2023

Использование BERT для классификации коротких научных текстов на русском языке

Кусакин И. К., Цурупа А. М., Алмакаев А. В. et al., В кн.: НТИ-2022. Научная информация в современном мире: глобальные вызовы и национальные приоритеты : материалы 10-ой научной конференции с международным участием, посвященной 70-летию ВИНИТИ РАН, Москва, 25–26 октября 2022 года.: М.: ВИНИТИ РАН, 2022. С. 103–109.

This work is devoted to the study of approaches for training BERT-based classifiers of scientific articles to implement the application with the adoption of the best models for use in the infrastructure of the VINITI RAS. For this purpose, the BERT linguistic model was trained on a specialized corpus of scientific texts for subsequent use ...

Added: January 31, 2023

Исследование методов машинного обучения для классификации научных текстов на русском языке

Кусакин И. К., Федорец О. В., Romanov A., Научно-техническая информация. Серия 2: Информационные процессы и системы 2022 Т. 12 С. 6–9

This paper discusses modern approaches to natural language processing and appliance of artificial intelligence technologies in the task of classifying scientific texts in Russian. The report contains an analysis of implementations of text vectorization methods, a description of experiments with training various classifier models: from classical machine learning algorithms to neural network transformer architectures. ...

Added: January 31, 2023

Pulse of the Nation: Observable Subjective Well-Being in Russia Inferred from Social Network Odnoklassniki

Sergey Smetanin, Mathematics 2022 Vol. 10 No. 16 Article 2947

Policymakers and researchers worldwide are interested in measuring the subjective well-being (SWB) of populations. In recent years, new approaches to measuring SWB have begun to appear, using digital traces as the main source of information, and show potential to overcome the shortcomings of traditional survey-based methods. In this paper, we propose the formal model for ...

Added: August 15, 2022

Using a Homogeneous Semantic Network to Classify the Results of Genetic Analysis

Kharlamov A. A., Kulikov A., , in: Neuroinformatics and Semantic Representations: Theory and Applications.: Cambridge Scholars Publishing, 2020. P. 219–231.

В работе показано использование механизма сравнения семантических сетей текстов в задаче диагностики заболеваний с использованием сигнальных сетей. Выявление степени пересечения семантических сетей текстов позволяет говорить о степени их смыслового подобия. Однородная семантическая сеть как множество узлов, связанных дугами, имеет численные характеристики – частоты появления слов, а также пар слов в тексте, которые перенормируются с использованием ...

Added: December 7, 2021

TextAnalyst Technology for Automatic Semantic Analysis of Text

Kharlamov A. A., , in: Neuroinformatics and Semantic Representations: Theory and Applications.: Cambridge Scholars Publishing, 2020. P. 156–167.

На основе представлений об обработке информации в мозге человека [1] реализована технология автоматической смысловой обработки текстов TextAnalyst, позволяющая выявить ключевые понятия текста в их взаимосвязях, реализовать реферирование текстов и их смысловое сравнение (классификацию). Реализованы продукты, использующие функциональность этой технологии: персональный – TextAnalyst, и библиотека COM модулей – TextAnalyst SDK. ...

Added: December 7, 2021

Share of Toxic Comments among Different Topics: The Case of Russian Social Networks

Smetanin S., Komarov M. M., , in: IEEE 23rd Conference on Business Informatics (CBI).: IEEE Computer Society, 2021. P. 65–70.

With the widespread use of online social networks, it is becoming more and more difficult to monitor and analyse all the user-generated content. Toxic speech in online conversations should be treated as a matter with serious social gravity, since it may result in both negative impacts on mental health and violent actions in the physical ...

Added: September 14, 2021