Bibliography | Knappich, Valentin: Tests4J benchmark: execution-based evaluation of context-aware language models for test case generation. University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 35 (2023). 74 pages, english.
|
Abstract | Testing is a critical part of the software engineering process. The cost associated with writing and maintaining test suites has motivated the research community to develop approaches to automatically generate test cases for a given software. Evosuite is one of the most established tools for Java and has been shown to achieve high coverage. However, the test cases lack readability, motivating the application of language models (LMs) of code to the task. Evaluating such neural test case generators based on their execution requires substantial efforts to set up evaluation projects and to obtain metrics like coverage from them. In consequence, most prior work on Java test case generation has either evaluated models on a small number of selected methods under test (MUTs) or used Defects4J as an evaluation benchmark. However, small benchmarks suffer from high variance, and many projects in Defects4J have been used to pre-train LMs of code. To fill that gap, we introduce Tests4J, a novel benchmark for neural and non-neural test case generators. Tests4J contains 12k test cases from 60 Java projects, out of which 41 are used for training while 19 are used for evaluation. For all projects, it includes the complete repository, enabling execution-based evaluation and open-ended experimentation with project-specific context information. In a single command, Tests4J allows researchers to obtain execution-based metrics like coverage and intrinsic metrics like loss, BLEU and crystalBLEU. Using Tests4J, we train and evaluate several test case generation models based on PolyCoder with 400M parameters. We compare Evosuite to our best neural model and find that the individual test cases achieve similar coverage. However, Evosuite generates 3 times as many test cases, covering about 3 times as many lines in total. We furthermore find that Evosuite fails to generate any test cases for 4 out of 11 projects in the test set. This presents a fundamental advantage of LMs: they do not need to integrate with the project and thus don’t suffer from dependency conflicts. Next, we evaluate prefix tuning as a training method and find that there is a significant gap to full finetuning. We further investigate the importance of project-specific context information and create simplified representations of the focal class and the test class. We find that adding this context information increases the achieved coverage by more than 4x, and that the focal class and test class context are highly complementary. Motivated by this finding, as well as the hard token limit and quadratic complexity of transformers, we propose Context-Aware Prompt Tuning (CAPT). In CAPT, context information is first compressed into embeddings, and then injected into the LM as soft tokens, similar to prefix tuning. We find that the method does not yield significant improvements over the baseline, but present directions for future research. Lastly, we find that loss is not an ideal indicator of coverage and that there is a high variance in coverage between projects, and thus advocate for large-scale execution-based evaluations.
|