Masterarbeit MSTR-2024-23

Bibliograph.
Daten
Thewes, Jan-Philipp: Multimodal LLM for Theory of Mind modeling in collaborative tasks.
Universität Stuttgart, Fakultät Informatik, Elektrotechnik und Informationstechnik, Masterarbeit Nr. 23 (2024).
72 Seiten, englisch.
Kurzfassung

The ability to infer the beliefs, desires, and intentions of others, known as Theory of Mind (ToM), is crucial for effective collaboration. In this work, we explore this ability in the context of task-oriented human-machine collaboration with a focus on Multimodal Large Language Models (MM-LLMs). While previous works relied on fixed question-answer pairs or explicit ToM modeling, we investigate the implicit ToM capabilities of MM-LLMs within the multimodal research environment Mindcraft. We propose a model architecture that integrates video, text and knowledge graphs to create a more realistic and flexible collaborative interface. Our findings show that MM-LLMs not only outperform specialized baseline models in ToM tasks but also achieve human performance in some scenarios. Furthermore, our model accurately predicts its own and the partner's missing knowledge in collaborative situations, demonstrating its potential for common-ground reasoning. However, the importance of multimodality for ToM tasks could not be confirmed in our experiments, suggesting that task-specific video sampling and encoding might be crucial for successful multimodal reasoning. Overall, this work reinforces the potential of MM-LLMs to enable more intuitive and efficient human-machine collaborations while surpassing previous baselines in ToM task performance within multimodal environments.

Volltext und
andere Links
Volltext
Abteilung(en)Universität Stuttgart, Institut für Visualisierung und Interaktive Systeme, Visualisierung und Interaktive Systeme
BetreuerBulling, Prof. Andreas; Bortoletto, Matteo
Eingabedatum8. August 2024
   Publ. Institut   Publ. Informatik