Master Thesis MSTR-2021-88

BibliographyMukherjee, Adrika: Integrating common sense knowledge in visual question answering systems.
University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 88 (2021).
74 pages, english.

Visual Question Answering (VQA) is a long-standing problem in the junction of Natural Language Processing and Computer Vision. It is a task that requires developing a system to answer a natural language question about an image. This thesis explores knowledge-based VQA Systems. It is different from vanilla VQA in the sense that it requires additional external knowledge on top of visual attributes from the given image to produce the correct answer. The use of entire knowledge bases for training like ConceptNet [1] , ATOMIC [2], WebChild [3] etc. is common in knowledge-based VQA architectures e.g CONCEPTBERT [4]. Narasimhan et al. [5] and Ziaeefard et al. [6], proposed different knowledge retrieval and filtering methods which can be used to augment knowledge, but they are neither inerrant nor generalizable. These approaches have a drawback, given that a lot of irrelevant and unwanted facts generate noise, and the model cannot determine the fact required to answer the question. The common sense knowledge-based dataset, FVQA [7] has only one triple mapped to each question. This knowledge triple is essential to answer the given question. However, there is a need for having a more generalized system that can handle questions that require more than one fact to answer a question. Again, there may not be any existing fact in the Knowledge Base (KB) that is sufficient to answer the question. A more fundamental understanding of different objects around us is required. To this end, a new architecture is proposed which uses the automatic knowledge graph construction method formulated in COMET [8] to generate graph structures or set of fact triples relevant to a given image and question. This set of acquired triples is used to answer the given question for an image. Aditionally, This thesis work comprehensively studies existing approaches for knowledge-based VQA architectures and analyze their shortcomings. Finally, a novel pipeline-based architecture to integrate common sense knowledge into VQA systems, using an automatic knowledge graph construction mechanism, is proposed. COMET [8], trained on ConceptNet[1] is adopted as the source of general knowledge. The performance of the model is assessed on the challenging FVQA [7] dataset.This dataset was build keeping in mind a particular fact from a KB, associated with each Image-question pair which is essential to answer the question. Hence the model’s generalisability and degree of common sense grasping capability is rightly estimated by using FVQA dataset since COMET is not entitled to generate the exact triple required to answer the question. A set of experiments were conducted by using various knowledge graph embedding techniques, different modality combinations (image, knowledge, and question), and, combination of different modules in the pipeline. A comparison between two different attention-based models for final training was made. Finally, the best combination is compared with other SOTA approaches. The outcomes are equivalent to the performance of existing methods even though an actual KB or filtered triples from an actual KB was not always used for training. An accuracy of 61.92% was achieved when the Stacked-attention-based model [9] is applied for training with image, question, and knowledge features. Conclusively, a meticulous analysis of the model and future possibilities for extensions are reported. References [1] Hugo Liu and Push Singh. Conceptnet—a practical commonsense reasoning tool-kit. BT technology journal, 22(4):211—-226, 2004. [2] Maarten Sap, Ronan LeBras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A Smith, and Yejin Choi. Atomic: An atlas of machine commonsense for ifthen reasoning. AAI, 2019. [3] N. Tandon, G. de Melo, F. Suchanek, and G. Weikum. Webchild: Harvesting and organizing commonsense knowledge from the web. International Conference on Web Search and Data Mining. ACM, 2014. [4] Franois Gardres, Maryam Ziaeefard, Baptiste Abeloos, and Freddy Lecue. ConceptBert: Concept-aware representation for visual question answering. pages 489–498, November 2020. [5] Medhini Narasimhan and Alexander G. Schwing. Straight to the facts: Learning knowledge base retrieval for factual visual question answering. 2018. [6] Maryam Ziaeefard and Freddy Lecue. Towards knowledge-augmented visual question answering. pages 1863–1873, December 2020. [7] Peng Wang, Qi Wu, Chunhua Shen, Anton van den Hengel, and Anthony Dick. Fvqa: Fact-based visual question answering. 2017. [8] Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli elikyilmaz, and Yejin Choi. Comet: Commonsense transformers for automatic knowledge graph construction. 2019. [9] Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention networks for image question answering. 2016.

Full text and
other links
Department(s)University of Stuttgart, Institute for Natural Language Processing
Superviser(s)Vu, Prof. Ngoc Thang; Tilli, Pascal
Entry dateApril 26, 2022
   Publ. Computer Science