Bibliography | Burk, Felix: A dynamic analysis-based linter for Python. University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 34 (2023). 70 pages, english.
|
Abstract | The rising popularity of the Python language yields a wide range of programs, spanning across a large assortment of domains. Common domains include, data science, machine learning, web development, as well as database and networking applications. To implement such applications properly, developers rely on a range of analysis tools which assist them in terms of correctness, performance, and code quality. Both in academia and in the industry analysis tools have been proposed. However, general purpose analyses of Python programs often rely on static analysis, which is inherently limited in analysing dynamic languages, such as Python. This work presents an approach to analyse Python programs during their execution. As a result, our analyses are able to infer additional information, such as the exact control flow, the precise value, and type of each object allocated. We directly address the dynamic nature of the Python language by applying dynamic analysis concepts to realise a general purpose linter. We present a total of 22 rules to analyse general Python programs, which address correctness, performance and code quality. Moreover, we supplement those rules with 10 additional machine learning specific rules, which analyse the usage of popular machine learning libraries within Python’s ecosystem, such as scikit-learn and TensorFlow. Additionally, this work includes a novel approach to track and annotate objects observed during the execution, facilitating the future development of dynamic analysis tools. We evaluate our prototype by analysing submissions to the Kaggle platform, which include a wide range of programs related to data science and machine learning. Our findings include at least 4 violations with high risk regarding correctness. Additionally, we evaluate general purpose rules on 11 real world GitHub repositories of various domains. Our results include 8 medium risk rule violations and 1 rule violation with high risk regarding correctness. Most rules imposed by our prototype report few false positives. Furthermore, the overhead is at most a 7x increase, which we regard as as acceptable for practical use. Overall, the results show that dynamic analysis yields promising results for both general purpose and machine learning applications within Python’s ecosystem.
|