Master Thesis MSTR-2017-75

BibliographySzilagyi, Alexander: Analysis and representation of dataflows and performance influences in NoSQL systems.
University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 75 (2017).
97 pages, english.
Abstract

The dispersion of digitalization is gaining more and more attention in the industry. Trends like industry 4.0 results in an enormous growth of data. But with the abundance of data, new opportunities are possible, that could lead to a market advantage against competitors. In order to deal with the huge pool of information, also referred as “Big Data”, companies are facing new challenges. One challenge relies in the fact, that ordinary relational database solutions are unable to cope with the enormous amount of data, hence a system must be provided, that is able to store and process the data in an acceptable time. Because of that, a well known company in the automotive sector established a distributed NoSQL system, based on Hadoop, to deal with this challenge. But the processes and interactions behind such a Hadoop system are more complex than with a relational database solution, hence wrong configurations leads to an ineffective operation. Therefore this thesis deals with the question, how the interactions between the applications are realized, and how possible lacks in the performance can be prevented. This work also presents a new approach in dataflow modeling, because an ordinary dataflow meta description has not enough syntax elements to model the entire NoSQL system in detail. With different use cases, the dataflow present interactions within the Hadoop system, which provide first valuations about performance dependencies. Characteristics like file format, compression codec and execution engine could be identified, and afterwards combined into different configuration sets. With this sets, randomized test were executed to measure their influence in the execution time. The results show, that a combination of an ORC file format with a Zlib file compression leads to the best performance regarding the execution time and disk utilization of the implemented system at the company. But the results also show, that configurations on the system could be very dependent on the actual use case. Beside the performance measurements, the thesis provide additional insights. For example, it could be observed, that an increase of the file size leads to a growth of the variance. However, the study, that was conducted in this work leads to the assumption, that performance tests should be repeatedly investigated.

Full text and
other links
Volltext
Department(s)University of Stuttgart, Institute of Software Technology, Software Engineering
Superviser(s)Wagner, Prof. Stefan; Graziotin, Dr. Daniel; Heß, Christian
Entry dateMay 29, 2019
   Publ. Computer Science