| Bibliography | Yu, Qiansu: Audio-Visual-Semantic Gaze Target Prediction. University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 95 (2025). 61 pages, english.
|
| Abstract | Gaze behavior provides essential cues for understanding social interaction and conversational dynamics. Although recent gaze target prediction methods achieve strong performance using visual information alone, they remain limited in real multi-party dialogue, where audio signals and spoken semantics also influence attention. Existing audio-visual approaches often rely on simplified assumptions about conversational behavior and generally overlook the semantic content of speech, leaving the role of verbal information insufficiently explored. This thesis addresses these limitations by introducing a tri-modal audio–visual–semantic gaze prediction framework. We extract vocal characteristics, transcribe speech into utterance-level semantics, and derive multilingual text embeddings to capture conversational meaning, while visual scene context is encoded using pre-trained visual backbones. All modalities are time-aligned and fused through cross-modal attention to generate pixel-level gaze heatmaps. To evaluate the contribution of each information source, we conduct ablation studies comparing our full model against visual-only baselines and audio-visual frameworks. The results reveal how vocal cues and conversational semantics influence gaze behavior and demonstrate the potential of multimodal modeling for understanding attention in dynamic social interactions.
|
| Department(s) | University of Stuttgart, Institute of Visualisation and Interactive Systems, Visualisation and Interactive Systems
|
| Superviser(s) | Bulling, Prof. Andreas; Qiu, Huajian |
| Entry date | March 16, 2026 |
|---|