Bibliography | Sarangi, Sunayana: Optimizing the efficiency of data-intensive Data Mashups using Map-Reduce. University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 89 (2017). 63 pages, english.
|
Abstract | In order to derive knowledge and information from data through data processing, data integration and data analysis, a variety of Data Mashup tools have been developed in the past. Data Mashups are pipelines that process and integrate data based on different interconnected operators that realize data operations such as filter, join, extraction, alteration or integration. The overall goal is to integrate data from different sources into a single one. Most of these Mashup tools offer a grahical modeling platform, enabling the users to model the data sources, data operations and the data flow, thus, creating a so called Mashup Plan. This enables non-IT experts to perform data operations without having to deal with their technical details. Further, by allowing easy re-modeling and re-execution of the Mashup Plan, it also allows an iterative and explorative trial-an-error integration to enable real time insights into the data. These existing Data Mashup tools are efficient in executing small size data sets, however, they do not emphasize on the run-time efficiency of the data operations. This work is motivated by the limitations of current Data Mashup approaches with regard to data-intensive operations. The run-time of a data operation majorly varies depending on the size of the input data. Hence, in scenarios where one data operation expects inputs from multiple Data Mashup pipelines, which are executed in parallel, a data intensive operation in one of the Data Mashup pipelines leads to a bottleneck, thereby delaying the entire process. The efficiency of such scenarios can be greatly improved by executing the data-intensive operations in a distributed manner. This master thesis copes with this issue through an efficiency optimization of pipeline operators based on Map-Reduce. The Map-Reduce approach enables distributed processing of data to improve the run-time. Map-Reduce is divided into two main steps: (i) the Map step divides a data set into multiple smaller data sets, on which the data operations can be applied in parallel, and (ii) the Reduce step aggregates the results into one data set. The goal of this thesis is to enable a dynamic decision making while selecting suitable implementations for the data operations. This mechanism should be able to dynamically decide, which pipeline operators should be processed in a distributed manner, such as using a Map-Reduce implementation, and which operators should be processed by existing technologies, such as in-memory processing by Web Services. This decision is important because Map-Reduce itself can lead to a significant overhead while processing small data sets. Once it is decided that an operation should be processed using Map-Reduce, corresponding Map-Reduce jobs are invoked that process the data. This dynamic decision making can be achieved through WS-Policies. Web Services use policies to declare in a consistent and standardized manner what they are capable of supporting and which constraints and requirements they impose on their potential requestors. By comparing the capabilities of the Web Service with the requirements of the service requestor, it can be decided if the implementation is suitable for executing the data operation.
|