Bachelor Thesis BCLR-2020-64

BibliographyGröninger, Lars: Building an Extensible Dataset of Code Reviews.
University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Bachelor Thesis No. 64 (2020).
59 pages, english.
Abstract

Code review is an integral part in modern software development. Performing code review means that every code change, before being committed to a shared repository, gets reviewed by at least one developer other than the author of the changes. Employing code review brings many benefits, such as defect finding, overall improvement of code quality, and knowledge transfer. Nowadays code review is mostly done tool-assisted and in asynchronous fashion. Nonetheless, most of the tasks involved are still done manually, thus making code review time consuming. Therefore, the process of doing code review can be further streamlined by automating some of those tasks. One possible direction, automating the provision of feedback for given source code changes, is called the comment prediction task. Another direction, called the code update prediction task, is to predict code updates based on comments from reviewers. However, to be able to solve these two tasks, a huge and low-noise dataset of code reviews that contains data about source code changes and comments is required. In this thesis we develop a tool for gathering code review data from real world open-source projects. Using this tool we perform a large scale data gathering and collect code review data from eight di erent professionally developed projects. As a result, we end up with a dataset containing over 200,000 code reviews, including every code change performed during the code reviews and every comment written in them. To the best of our knowledge this is the largest and most diverse code review dataset to date. On top of that, the built dataset is extensible meaning that more data can be collected easily. Our dataset could, among other things, be used to address the previously mentioned prediction tasks by serving as training data for machine learning models, thus, making the work of this thesis the first step towards improving the current process of performing code reviews.

Department(s)University of Stuttgart, Institute of Software Technology, Software Lab - Program Analysis
Superviser(s)Pradel, Prof. Michael; Habib, Andrew
Entry dateJanuary 18, 2021
   Publ. Computer Science