Master Thesis MSTR-2024-06

BibliographyMa, Yingpeng: Analysing human vs. neural attention in VQA.
University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 6 (2024).
55 pages, english.
Abstract

Visual Question Answering (VQA) has drawn substantial interest in both academic and industrial research fields in recent years. Driven by Vision Transformers (ViT) and the vision-text co-attention mechanism, these models have shown notable performance improvement. Yet, the black-box nature of neural attention impedes people from understanding its functionality and establishing their trustworthiness. Drawing inspiration from various scholars and their contributions, this thesis demystifies these mechanisms. We aim to 1) extract the neural attention weights of VQA models, 2) remap the weights to machine attention maps, 3) compare machine attention with human gazing heatmaps, and 4) compute the related metrics to provide deeper insights into the attention patterns. First, the attempts to reproduce the MCAN model implementation and machine attention extraction on the VQA-MHUG dataset are performed on the MULAN framework. Through a comparison with official implementations, the accuracy and correctness of the re implementation have been verified. Then, utilizing the toolkit of the MULAN framework, the 1D attention weights are remapped to 2D neural attention maps. Next, these attention maps are compared to human-gazing heatmaps of VQA-MHUG using explainable AI (XAI) metrics. Following the above pipeline, another experiment on the AiR-D dataset is conducted and reports the Area Under ROC Curve (AUC), Spearman’s rank correlation coefficient (rho), and Jensen-Shannon Divergence (jsd) metrics to compare the neural attention with the human gazing heatmaps. Finally, the discussion of the differences between the official and re-produced implementations is presented alongside insights on the interpretability of neural attention in VQA models.

Full text and
other links
Volltext
Department(s)University of Stuttgart, Institute of Visualisation and Interactive Systems, Visualisation and Interactive Systems
Superviser(s)Bulling, Prof. Andreas; Wang, Yao; Hindennach, Susanne
Entry dateMay 21, 2024
   Publ. Computer Science