Master Thesis MSTR-2025-65

BibliographyBaba, Malek: Resource-Efficient Deep Learning for Real-Time Recognition of Worker Activities During Industrial Assembly.
University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 65 (2025).
68 pages, english.
Abstract

Advances in deep learning have enabled significant progress in human activity recognition (HAR), particularly in industrial assembly environments where understanding finegrained human-object interactions can enhance productivity, safety, and human-robot collaboration. Despite their success, state-of-the-art video recognition models remain computationally intensive, posing major barriers to deployment in resource-constrained settings such as embedded systems or edge devices. This thesis addresses the challenge of designing and deploying resource-efficient deep learning models capable of real-time activity recognition during industrial assembly tasks, using the MECCANO dataset as the primary benchmark. The MECCANO dataset is a multimodal egocentric video benchmark that captures realistic worker interactions with tools and components across complex assembly procedures. It presents 61 fine-grained action classes and supports multiple tasks including action recognition, anticipation, and human-object interaction detection. Given its long-tailed class distribution and high intra-class variability, MECCANO serves as a challenging and realistic dataset for evaluating activity recognition models in constrained environments. To tackle the problem of computational inefficiency, we begin by benchmarking several lightweight 3D convolutional neural networks, including I3D, 3D-MobileNet, X3D-L, and X3D-M. Based on an analysis of accuracy, model size, FLOPs, and inference speed, X3DM is selected as the optimal baseline model due to its balance between performance and resource requirements. To further compress the model, we implement structured pruning techniques using L1- norm–based filter selection. We evaluate local, global, and isomorphic structured pruning strategies on the X3D-M architecture. Among them, isomorphic pruning emerges as the most effective, offering significant reductions in model size and computational cost while retaining structural consistency. We extend the pruning experiments by varying the pruning ratio to study the trade-offs between compression and accuracy. These experiments demonstrate that the pruned model retains competitive performance even with over 40% parameter reduction. To mitigate the accuracy degradation caused by pruning, we integrate knowledge distillation (KD) as a fine-tuning step. Using the high-capacity SlowFast model as the teacher, we train the pruned X3D-M model to match the softened output distributions of the teacher, alongside standard supervised learning objectives. Experimental results confirm that KD significantly improves the performance of pruned models, enabling the final Pruned + KD X3D-M model to surpass the original uncompressed baseline, achieving 39.46% Top-1 accuracy and 68.72% Top-5 accuracy on the MECCANO test set. To validate the practical deployment of our approach, we conduct real-time inference experiments on a constrained CPU-only device (Intel® CoreTM i5-1035G1). The optimized model processes a 2-minute test video in just 36.2 seconds—more than twice as fast as the original SlowFast baseline—demonstrating its feasibility for real-world industrial applications. Finally, we present a unified optimization pipeline that combines training, structured pruning, iterative benchmarking, and knowledge distillation into a single deployable framework. This framework allows for the systematic exploration of accuracy–efficiency trade-offs and supports scalable model deployment on edge platforms. In conclusion, this thesis contributes a comprehensive methodology for resource-efficient video-based activity recognition in industrial settings. By leveraging lightweight models, structured pruning, and knowledge distillation, we demonstrate that it is possible to maintain high recognition performance while significantly reducing the computational burden. These findings open new opportunities for real-time, on-device human activity monitoring in smart manufacturing systems, and provide a foundation for future research on efficient deep learning for complex real-world environments.

Department(s)University of Stuttgart, Institute of Visualisation and Interactive Systems, Visualisation and Interactive Systems
Superviser(s)Roitberg, Jun.-Prof. Alina; Bruhn, Prof. Andrés; Thiyakesan Ponbagavathi, Thinesh
Entry dateDecember 19, 2025
New Report   New Article   New Monograph   Computer Science