Error Analysis for Visual Question Answering
In recent years, the task of visual question answering (VQA) at the intersection of computer vision and natural language processing is gaining interest in the scientific community. Even though modern systems achieve good results on standard datasets, these results are far from what is achieved in Computer Vision or Natural Language Processing separately, for example, in tasks of image classification or machine translation. One of the reasons for this phenomenon is the problem of modelling the interaction between modalities, which is partially solved by using the attention mechanism, as, for example, in the models used in this paper. Another reason lies in the statement of the problem itself. In addition to the problems inherited from CV and NLP, there are problems associated with the variety of situations shown in the picture and the possible questions for them. In this paper, we analyze errors for the state-of-the-art approaches and separate them into several classes: text recognition errors, answer structure, entity counting, type of the answer, and ambiguity of an answer. Text recognition errors occur when answering a question like “what is written in ..?” and associated with the representation of the image. Errors in the answer structure are associated with the reduction of the VQA to the classification task. Entity counting is a known weakness of current models. A typical situation of errors in the type of answer is when the model answers the “Yes/No” question in a different way. Errors from the ambiguity of an answer class occur when the model produces an answer that is correct in meaning but does not coincide with the formulation of the ground truth. Addressing these types of errors will lead to the overall improvement of VQA systems.