| Kurzfassung | The recognition of actions in egocentric videos is an increasingly important topic due to the continuous rise of wearable augmented reality devices. Most current methods primarily focus on single-modality approaches, such as using RGB images with vision models. However, these approaches often lack relational information, which can be critical for understanding action scenes. Modern approaches incorporate multi-modal data, such as audio or gaze, but often leverage all modalities to predict the action classes directly. This work takes a different approach, instead of predicting actions as a whole, we split the task into sub-tasks by separately predicting verbs and nouns. Our method, selectively employs modalities in contexts where they are most effective. Therefore, we propose a hierarchical multi-modal action recognition model that effectively combines diverse visual modalities including hand-object interactions, gaze data, scene semantics, motion dynamics, and RGB images. The model incorporates transformer-based graph and vision models to effectively integrate visual and relational information. This design allows the model to capture the distinct contribution of each modality to identify the correct action. While the proposed model did not achieve state-of-the-art accuracy, experimental results demonstrate its effectiveness in integrating multi-modal information hierarchically and its potential for improving action recognition. This research highlights the promise of graph-based architectures in multi-modal learning and lays a foundation for more holistic modality integration and more efficient action recognition systems.
|