Master Thesis MSTR-2023-31

BibliographyHauser, Marius: Machine Learning frameworks in open-source software: an exploratory study on code and project smells.
University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 31 (2023).
55 pages, english.
Abstract

Machine Learning (ML) gained an increasing amount of interest in recent years. The widespread use of ML systems raises questions regarding technical debt and the utilisation of good software engineering practices. ML systems are complex systems which are faced with additional challenges, compared to traditional software systems. These additional challenges are among others facilitated in the areas of dependency management and data versioning which can lead to an increased susceptibility to technical debt. To address those new challenges, this study investigates how the choice of a ML library is associated with types and frequency of code and project smells. This study additionally acquires the distribution of application areas in open-source projects that use Machine Learning libraries. In this study repository mining is performed, followed by a large-scale analysis of code and project smells using SonarQube and mllint. SonarQube is used to find code smells in python source code and project smells are tracked using a score calculated by mllint. A lower mllint score corresponds to more project smells. All mined repositories are categorised in domain categories using an automated classifier and in exploratory found topics using Latent Dirichlet Allocation. This study analyses 6,840 open-source software repositories which use the ML libraries "Tensor-Flow", "Scikit-learn", "Transformers", "Keras", "PyTorch", "Keras", and "Keras & TensorFlow". Violations of naming conventions, commented-out code sections, and high cognitive complexity of source code sections are the most prominent code smells found among all repositories. Several correlations between projects using ML libraries and the frequency of code and project smells have been found. Statistical analysis revealed that using TensorFlow as ML library is associated with ~12.6% more code smells per 1,000 LoC than using Transformers. Using Scikit-learn is correlated with a ~25% higher mllint score compared to TensorFlow and a ~44% higher mllintscore than using PyTorch. Regarding project smells this study revealed that using Transformers is correlated with a ~13% higher average mllint score than using TensorFlow. Application areas of Machine Learning libraries are for the most part in line with the areas advertised by their publishers. This is, among other reasons, due to the general application of their libraries as stated by their publishers. Based on this study, future work can investigate causal relationships between ML libraries which are associated with a higher frequency of code and project smells than other ML libraries. Furthermore, an investigation of practitioners, maintainers, and developers working with ML libraries can reveal differences in the types of users and formulate best practices to reduce code and project smell

Full text and
other links
Volltext
Department(s)University of Stuttgart, Institute of Software Technology, Empirical Software Engineering
Superviser(s)Wagner, Prof. Stefan; Bogner, Dr. Justus
Entry dateSeptember 19, 2023
   Publ. Computer Science