| Bibliography | Munshi, Ankita: Harnessing the Power of Large Language Models for Data Quality Applications: Challenges and Solutions. University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 60 (2025). 99 pages, english.
|
| Abstract | The traditional approach of regulatory compliance for data quality-related aspects is manual and time-intensive. Initially, it requires the domain experts to go through the regulatory documents and manually identify the sections that are relevant for data quality. Then, the relevant sections are used to derive the data quality rules to comply with the regulations. This filtering and rules generation process is time-consuming. Moreover, given the dynamic nature of regulations, it is often a repetitive task as continued compliance must be ensured. Currently, there is an absence of a tool that can automate it to make the process e!cient. In this master’s thesis, we propose a solution to this problem. We leverage the advancement in the field of Artificial Intelligence (AI) in the form of Large Language Models (LLM)s for the data quality domain. We use LLMs to create a tool capable of generating data quality rules. We first conduct an exhaustive review of LLMs with other augmentation techniques such as Retrieval Augmented Generation (RAG). Following this, we extract the requirements for the development of such a tool. Then we map these requirements to their technical value. In this thesis, we present our contributions to the development of the data quality rules generation tool. We discuss the various components of the tool. These include components for extracting relevant text, identifying data quality dimensions, and finally generating the data quality rules. For this, first, we explore and implement a RAG-based tool for the task. However, after initial tool evaluation, we found some limitations of the tool in performing the data quality rules generation task. Therefore, we make some modifications and instead create a LangGraph-based tool. This tool has multiple nodes (acting as modules performing specific tasks). We design each module to perform a specific task, such as the identification of data quality dimensions and the generation of data quality rules. Then, we use LangGraph to orchestrate the nodes we create and generate the final response. Finally, we experiment with the tool using di"erent LLMs and evaluate their performance. We consider the end-to-end latency and a manual evaluation of the final responses generated. In this thesis, we systematically design and implement a solution that helps automate the data quality rules generation process. We leverage the capabilities of LLMs to reduce the time and e"ort and deliver a cost-e"ective solution. In addition, this solution contributes to research and advancements by the integration of AI in the field of data quality management and regulatory compliance.
|