Master Thesis MSTR-2025-75

BibliographyRehm, Tristan: Multi-Type Anomaly Detection and Segmentation in Zero-Shot Learning for Industrial Inspection.
University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 75 (2025).
83 pages, english.
Abstract

Abstract

The use of deep learning methods in industrial anomaly detection already enables reliable binary classification and segmentation of defective regions. In the zero-shot setting, most existing models reach their limits, as they are unable to distinguish between specific defect types. However, this ability is crucial for targeted root-cause analysis and the resulting effective process optimization. Moreover, many existing approaches are based on the CLIP model, whose limited input resolution and model size can restrict performance on fine-grained visual tasks. Larger, instruction-following Vision Language Models (VLMs) therefore represent a promising foundation to overcome these limitations and further advance industrial anomaly detection.

This work addresses the identified research gap by introducing LLaVA-ADS (LLaVA for Anomaly Detection and Segmentation), a novel framework that extends the large-scale, instruction-following VLM LLaVA for industrial inspection tasks. For this purpose, the model's vocabulary is expanded with two dedicated segmentation tokens, SEG_DEFECT and SEG_NORMAL, which serve as semantic representations of anomalous and normal image regions, respectively. Their latent embeddings are compared with the visual features to generate precise anomaly maps. This approach enables not only the textual classification of defects but also their accurate spatial localization.

On established industrial benchmarks, LLaVA-ADS achieves competitive and, in some cases, leading results in binary anomaly detection and segmentation. The model demonstrates fundamental capabilities in distinguishing between defect types. However, these are not yet sufficiently robust for practical deployment, particularly in cases involving multiple simultaneous defects.

The evaluation focused on a single VLM architecture, which may limit the transferability of the results to other models. Furthermore, defect classification in LLaVA-ADS relies on textual descriptions, whose quality can affect the robustness of the model. This dependency was not investigated in the scope of this thesis.

Future work should explore strategies for improving defect classification to better enable the differentiated identification of multiple defects within a single image. For instance, this could be achieved through the use of more diverse and comprehensive datasets.

Overall, this thesis demonstrates that large, instruction-following VLMs can be successfully adapted for specialized industrial inspection tasks, revealing substantial potential for future industrial applications.

Department(s)University of Stuttgart, Institute of Artificial Intelligence, Analytic Computing
Superviser(s)Staab, Prof. Steffen; Niepert, Prof. Mathias; Zhou, Hongkuan
Entry dateDecember 19, 2025
New Report   New Article   New Monograph   Computer Science