Master Thesis MSTR-2023-19

BibliographyNie, Shangrui: Video Based Crossmodal Representation Learner for Emotion Recognition.
University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 19 (2023).
49 pages, english.
Abstract

Emotion recognition is a critical aspect of human-computer interaction, with numerous applications in fields such as psychology, social robotics, and affective computing. Recent advances in deep learning have significantly improved the performance of emotion recognition models, particularly when integrating multimodal data sources such as audio and video. Despite these advancements, the potential of gaze information as an auxiliary modality in emotion recognition has been relatively underexplored, leaving room for further innovation. Additionally, there is a growing interest in pre-training feature extractors to enhance the performance of emotion recognition models, including gaze encoders, which have also been understudied. This thesis presents a novel approach to emotion recognition by incorporating gaze as the additional modality into a multimodal architecture with pre-trained audio, video, and gaze feature extractors on Voxceleb1 [NCZ17], specifically focusing on the Gaze-enhanced L3-Net and AVE-Net architectures. In addition to the exploration of gaze as an auxiliary modality and the pre-training of feature extractors, it is crucial to investigate the performance of emotion recognition models under single modality conditions. Real-world applications often face challenges in obtaining multimodal data due to factors such as limited resources, privacy concerns, and environmental constraints. By evaluating the effectiveness of our proposed models in single modality settings, we aim to provide a comprehensive understanding of their potential applicability and robustness in diverse scenarios. This aspect of our study highlights the importance of developing high-performing single modality models alongside multimodal approaches for emotion recognition. In this study, we provide compelling evidence that leveraging pre-trained feature extractors with gaze-enhanced visual and audio embeddings leads to substantial performance gains in emotion recognition models on the OMG Emotion dataset [BCL+18]. Our results show that the pre-trained Gaze-enhanced L3-Net outperforms both the original L3-Net and the AVE-Net, achieving F1 micro scores of 49.28 and 45.94 for the video and audio channels, respectively. Furthermore, our model also surpasses the performance of Abdou et al. [ASMB22]’s model-level fusion and early fusion techniques and achieves state-of-the-art performance on the OMG Emotion dataset [BCL+18] in both video channel and audio channel. Notably, our model’s architecture does not outperform Abdou et al. [ASMB22]’s architecture, and the improvements we observe can be largely attributed to the pre-training process. Ultimately, this study sheds light on the significant potential of incorporating pre-training and gaze information in emotion recognition tasks, paving the way for more accurate and robust models in real-world applications. By thoroughly investigating these aspects, we aim to motivate further research in the field and encourage the development of innovative approaches that capitalize on the unique advantages of gaze information and pre-training techniques for improved performance in emotion recognition tasks.

Department(s)University of Stuttgart, Institute of Visualisation and Interactive Systems, Visualisation and Interactive Systems
Superviser(s)Bulling, Prof. Andreas; Sood, Ekta; Strohm, Florian
Entry dateSeptember 19, 2023
   Publ. Computer Science