Master Thesis MSTR-2024-23

BibliographyThewes, Jan-Philipp: Multimodal LLM for Theory of Mind modeling in collaborative tasks.
University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 23 (2024).
72 pages, english.
Abstract

The ability to infer the beliefs, desires, and intentions of others, known as Theory of Mind (ToM), is crucial for effective collaboration. In this work, we explore this ability in the context of task-oriented human-machine collaboration with a focus on Multimodal Large Language Models (MM-LLMs). While previous works relied on fixed question-answer pairs or explicit ToM modeling, we investigate the implicit ToM capabilities of MM-LLMs within the multimodal research environment Mindcraft. We propose a model architecture that integrates video, text and knowledge graphs to create a more realistic and flexible collaborative interface. Our findings show that MM-LLMs not only outperform specialized baseline models in ToM tasks but also achieve human performance in some scenarios. Furthermore, our model accurately predicts its own and the partner's missing knowledge in collaborative situations, demonstrating its potential for common-ground reasoning. However, the importance of multimodality for ToM tasks could not be confirmed in our experiments, suggesting that task-specific video sampling and encoding might be crucial for successful multimodal reasoning. Overall, this work reinforces the potential of MM-LLMs to enable more intuitive and efficient human-machine collaborations while surpassing previous baselines in ToM task performance within multimodal environments.

Full text and
other links
Volltext
Department(s)University of Stuttgart, Institute of Visualisation and Interactive Systems, Visualisation and Interactive Systems
Superviser(s)Bulling, Prof. Andreas; Bortoletto, Matteo
Entry dateAugust 8, 2024
   Publ. Computer Science