Bibliography | Balbach, Daniel: A framework for optimizing spark configurations. University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 49 (2022). 90 pages, english.
|
Abstract | The rising importance of data in modern life, industry, and society introduces a huge interest in processing them. Data-driven approaches nowadays are ubiquitous. Due to the increasing amount of data, the need to process large amounts of data has led to the development of complex, distributed, and scalable processing frameworks. Such a framework is the Apache Spark framework. It offers a rich set of functionalities like classic SQL analytics, machine learning functionalities, graph processing functionalities, and many more. However, the broad range of functionalities can potentially lead to problems. One of them is that due to the different requirement characteristics of the various Spark applications, the standard configuration of the Spark cluster may not be optimally adapted. A suboptimal configuration can lead to higher execution times or lower cluster throughput. Higher execution times can lead to higher costs in environments where the execution time is directly coupled to the billed costs, like in a cloud environment. Besides the financial aspect, a better-configured Spark application may also better use the provided resources and reduce the execution time, thus increasing the throughput. This work addresses this problem by designing and implementing an optimization framework for optimizing Spark configurations of a given Spark application. The optimization framework is then applied in a case study on two exemplary use cases using a Spark cluster in a Databricks environment of a private cloud to demonstrate its practicability. The results show the framework can optimize Spark configurations in general while causing only minimal effort to the applicant. However, outperforming the standard Spark configuration in the exemplary use cases proves to be challenging, especially due to observed runtime variances in the cloud environment. The distinction between statistical variance and real improvement is complex.
|