Master Thesis MSTR-2022-117

BibliographyMüller, Ann-Sophia: Investigating multi-modal human-like attention integration into VQA models.
University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 117 (2022).
86 pages, english.

Neural attention has proven helpful in many natural language processing and computer vision tasks. One task that combines both fields is Visual Question Answering (VQA). Allowing models to focus on specific aspects of the image or text in the form of neural attention has also shown improved results for VQA. Investigations of neural attention visualizations for transformer-based VQA models have shown that it differs from human attention. This led to the first advances in incorporating human attention as guidance during model training which has shown promising results for uni-modal integration. A multi-modal integration has just recently been proposed. Despite improvements through multi-modal human-like attention integration, critics point out that the position of neural attention weights does not directly correspond to the input, making the effectiveness of direct integration of human-like attention questionable. Therefore, in this thesis, the multi-modal human-like attention integration approach is 1) investigated by analyzing the effects of integrating human attention into different layers. 2) Extended by a framework that tries to re-map the raw neural weights to the input before integrating human-like attention. 3) Replaced by supervision based on a proposed multi-objective loss function that optimizes for correct answer prediction and guides the model to generate attention maps that re-mapped to the input are more human-like. Ablation of the proposed loss function leads to a slightly better performance than MULAN on the VQAv2 val split. The thesis furthermore provides deep insights into internal attention representations of the model.

Department(s)University of Stuttgart, Institute of Visualisation and Interactive Systems, Visualisation and Interactive Systems
Superviser(s)Bulling, Prof. Andreas; Sood, Ekta
Entry dateApril 8, 2024
   Publ. Computer Science