Masterarbeit MSTR-2025-19

Bibliograph.
Daten
Majumdar, Souptik Kumar: Enhancing Computational Efficiency in Vision Tasks using Mixture of Depths and Experts (MODE) models.
Universität Stuttgart, Fakultät Informatik, Elektrotechnik und Informationstechnik, Masterarbeit Nr. 19 (2025).
87 Seiten, englisch.
Kurzfassung

Advancements in deep learning have enabled Vision Transformers (ViTs) to achieve state-of-the-art performance on a range of tasks, but at the cost of significantly increased computational demands. Recent studies have shown that not all tokens that are processed by a transformer are important. In this thesis, we propose a novel Mixture-of-Depths (MoD) framework that dynamically allocates computation by processing only a subset of tokens in each Transformer block. Unlike conventional models that treat all tokens uniformly, MoD leverages token-specific routing to skip unnecessary computations, thereby reducing FLOPs without sacrificing accuracy. A key innovation in this work is the introduction of an attention routing mechanism, termed attention Mixture-of-Depths (A-MoD). Instead of relying on additional trainable router networks, A-MoD harnesses the intrinsic attention maps computed by preceding layers to estimate token importance. By aggregating these attention scores, this method selectively routes the most semantically informative tokens through computationally expensive operations, resulting in improved convergence and higher accuracy. Furthermore, recognizing that fixed per-layer capacities may be suboptimal, we integrate Reinforcement Learning (RL) based capacity search to optimize each MoD layer’s capacity. A policy is learned and used to predict the fraction of tokens to process per layer, with rewards that balance validation accuracy, computational cost, and training stability. This dynamic capacity optimization allows the model to adaptively allocate resources, ensuring that the overall computation meets a predefined target. Empirical evaluations on large-scale benchmarks such as ImageNet-1k, as well as transfer learning experiments on smaller, fine-grained datasets, demonstrate that the integrated approach not only accelerates training convergence but also improves accuracy and efficiency relative to conventional routing mechanisms and fixed-capacity baselines. These findings highlight the potential of the proposed MoD framework, combined with A-MoD and RL based capacity optimization, to enable the deployment of high-capacity Vision Transformers in resource-constrained environments, such as embedded systems and mobile devices.

Abteilung(en)Universität Stuttgart, Institut für Künstliche Intelligent, Maschinelles Lernen in den Simulationswissenschaften
BetreuerNiepert, Prof. Mathias; Staab, Prof. Steffen; Schott, Dr. Lukas
Eingabedatum13. August 2025
   Publ. Informatik