CLEVR-BT-DB: a benchmark dataset to evaluate the reasoning abilities of deep neural models in visual question answering problems

Latipov I.; Andrey Borevskiy; A. Kertesz-Farkas

doi:10.1117/12.3027602

Publications

?

CLEVR-BT-DB: a benchmark dataset to evaluate the reasoning abilities of deep neural models in visual question answering problems

Ch. 1316909.

Latipov I., Andrey Borevskiy, Kertesz-Farkas A.

Deep learning-based machine reasoning and visual question answering models achieve a near-human performance on their respective datasets; however, their performance dramatically drops under domain shift suggesting that models fail to generalize to the level of human-like reasoning. In this paper we present a new CLEVR-like dataset consisting of images-question pairs to evaluate the visual reasoning capability of deep models. The objects in the images are arranged in a way that the first half of the question is ambiguous and multiple answers seem to be correct up to this point; however, the second half of the question clarifies the situation and makes the whole visual question-answering (VQA) task unambiguous, and a unique answer can be reported. Therefore, deep models during their reasoning process need to handle ambiguousness in their neurons. They can handle this either via graph (or tree) traversing in the search space with using back-tracking technique or via refining a candidate set of possibly correct answers by iteratively eliminating incorrect ones upon some reasoning calculations. We call this data-set CLEVR with Back-Tracking Database, CLEVR-BT-DB. It consists of 2,500 images and 10,000 questions in the same format as the standard CLEVR, and it is available at https://huggingface.co/datasets/Aborevsky01/CLEVR-BT-DB site. The code to generate additional data is available at https://github.com/AFigaro/CLEVR_BT_DB site. We tested MDETR method, a recent deep model for VQA from Meta Research, it achieved an accuracy of 99.7 % on the Standard CLEVR dataset; however, it achieves an accuracy of 28.01 % on our CLEVR-BT-DB dataset.

Keywords: Visual question answering machine reasoning

Publication based on the results of:

Robust and accurate analysis of the data modalities in mass spectrometry (2024)

In book

Proceedings Volume 13169. Fifth International Conference On Computer Vision And Computational Intelligence (CVCI 2024) 29-31 January 2024, Bangkok, Thailand

SPIE, 2024.

RuCLEVR: A Russian Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

Biryukova K., Chelnokova D., Erkenova J. et al., Communications in Computer and Information Science 2024 Vol. 2364 CCIS P. 109 – 121

Added: February 25, 2026

Analyzing the Robustness of Vision & Language Models

Shirnin A., Andreev N., Potapova S. et al., IEEE/ACM Transactions on Speech and Language Processing 2024 Vol. 32 P. 2751–2763

We present an approach to evaluate the robustness of pre-trained vision and language (V&L) models to noise in input data. Given a source image/text, we perturb it using standard computer vision (CV) / natural language processing (NLP) techniques and feed it to a V&L model. To track performance changes, we explore the problem of visual ...

Added: July 19, 2024

Error Analysis for Visual Question Answering

Podtikhov A., Shaban M., Kovalev A. et al., , in: Advances in Neural Computation, Machine Learning, and Cognitive Research IV. Selected Papers from the XXII International Conference on Neuroinformatics 2020. Studies in Computational Intelligence.Vol. 925.: Springer, 2021. P. 283–292.

In recent years, the task of visual question answering (VQA) at the intersection of computer vision and natural language processing is gaining interest in the scientific community. Even though modern systems achieve good results on standard datasets, these results are far from what is achieved in Computer Vision or Natural Language Processing separately, for example, ...

Added: October 30, 2020