Master Thesis MSTR-2025-67

BibliographyZhao, Jiahao: Situated interactive guidance and assistance in extended reality using multimodal large language models.
University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 67 (2025).
54 pages, english.
Abstract

This paper addresses the need for "providing actionable guidance for process tasks in real-world environments†by proposing an interactive XR system that combines a multimodal large language model with Apple Vision Pro. The system employs a three-stage architecture with cloud-end decoupling: Vision Pro handles acquisition and spatial projection, a Python bridge server manages sessions and state machines, and the LLM backend uses an open-source vision model for object localization and generates voice/text instructions. We also compare local inference models with AWS cloud-based inference models: local deployment offers lower latency, while cloud-based deployment is more scalable but subject to greater EBS throughput and initialization overhead. Finally, a study with eight participants shows that the AR-based SUS had higher usability and lower overall NASA-TLX load. The difference in completion time was not significant but trended faster. Participants also perceived AR guidance as more accurate. The paper discusses trade-offs between local GPU and cloud-based deployment, current technical bottlenecks, and potential future research directions.

Full text and
other links
Volltext
Department(s)University of Stuttgart, Institute of Visualisation and Interactive Systems, Visualisation and Interactive Systems
Superviser(s)Schmalstieg, Prof. Dieter; Pabst Michael
Entry dateDecember 19, 2025
New Report   New Article   New Monograph   Computer Science