Master Thesis MSTR-2022-116

Bibliography	von Hochmeister, Manuel: Attention-based Reasoning over Multimodal Embeddings in Video-grounded Dialogue. University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 116 (2022). 49 pages, english.
Abstract	Over the past few years the interaction with assistance systems through natural language has become a regular part of many peoples lives. While solely text-based use cases like the speech assistant systems in modern phones have made huge advancements, implementing video dialogue systems remains a very challenging task. Such models could be very useful in the context of autonomous cars or service robots but suffer from a variety of problems. Those problems include the lack of understanding of the visual input as well as insufficient long term reasoning capabilities. Another problem is that such models often struggle when provided a question which requires temporal or spatial localization in the video data. This thesis evaluates a novel video dialogue system architecture which is based on transformer layers and includes a new kind of state tracking component. The state tracker is the key contribution of this thesis and uses the attention weights of the last transformer layer to identify relevant visual inputs for answering previous questions. Those visual inputs as well as the previous questions and answers are then used to create state vectors which represent the dialogue history. To evaluate the new model the DVD dataset is used since it is designed with current model weaknesses in mind and incorporates many different references across the individual dialogue turns which need to be resolved by the system. The results of the experiment show that the new model architecture outperforms the baseline from the DVD dataset and also the baseline from a generative dataset while the state tracker component significantly contributes to the overall model performance on both. Performance gains are achieved over all provided categories including the ability to resolve different short-term dependencies.
Department(s)	University of Stuttgart, Institute of Visualisation and Interactive Systems, Visualisation and Interactive Systems
Superviser(s)	Bulling, Prof. Andreas; Abdessaied, Adnen
Entry date	April 8, 2024

Publ. Computer Science