Bibliography | Thewes, Jan-Philipp: Multimodal LLM for Theory of Mind modeling in collaborative tasks. University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 23 (2024). 72 pages, english.
|
Abstract | The ability to infer the beliefs, desires, and intentions of others, known as Theory of Mind (ToM), is crucial for effective collaboration. In this work, we explore this ability in the context of task-oriented human-machine collaboration with a focus on Multimodal Large Language Models (MM-LLMs). While previous works relied on fixed question-answer pairs or explicit ToM modeling, we investigate the implicit ToM capabilities of MM-LLMs within the multimodal research environment Mindcraft. We propose a model architecture that integrates video, text and knowledge graphs to create a more realistic and flexible collaborative interface. Our findings show that MM-LLMs not only outperform specialized baseline models in ToM tasks but also achieve human performance in some scenarios. Furthermore, our model accurately predicts its own and the partner's missing knowledge in collaborative situations, demonstrating its potential for common-ground reasoning. However, the importance of multimodality for ToM tasks could not be confirmed in our experiments, suggesting that task-specific video sampling and encoding might be crucial for successful multimodal reasoning. Overall, this work reinforces the potential of MM-LLMs to enable more intuitive and efficient human-machine collaborations while surpassing previous baselines in ToM task performance within multimodal environments.
|
Full text and other links | Volltext
|
Department(s) | University of Stuttgart, Institute of Visualisation and Interactive Systems, Visualisation and Interactive Systems
|
Superviser(s) | Bulling, Prof. Andreas; Bortoletto, Matteo |
Entry date | August 8, 2024 |
---|