Diplomarbeit DIP-2014-01

Bibliograph. Daten	Merg, Michael: Performance Analysis of a Content Management Workload Using Column-oriented Database Technologies. Universität Stuttgart, Fakultät Informatik, Elektrotechnik und Informationstechnik, Diplomarbeit Nr. 1 (2014). 97 Seiten, englisch.
CR-Klassif.	C.4 (Performance of Systems) D.2.8 (Software Engineering Metrics) H.3.4 (Information Storage and Retrieval Systems and Software)
Kurzfassung	Understanding and managing vast amounts of data is a growing need for today’s enterprises. According to IDC’s Digital Universe study the global amount of data will grow from 2005 to 2020 by a factor of 300. In numbers, this means the global volume of data will increase from 130 exabyte in 2005 to an overall size of forty zettabyte in the year 2020 [GR12]. In 2010 the number of companies storing a petabyte or more data was relatively limited. According to forecasts by EMC the number will grow to over 100,000 organizations by the end of the decade [KG11]. The majority of enterprise data is stored in a loosely structured or unstructured form, for example as files in file systems, content repositories and groupware applications. Only part of the data an enterprise has is in a well defined, structured form. For example in a database system following a well defined schema. It is a big challenge to manage these large amounts of (unstructured) data, particularly to follow regulatory retention requirements and legal hold orders while at the same time not storing data unnecessarily or past its useful life time. To gain control of the unstructured data an enterprise has to create an information inventory, collect the meta information of data objects and possibly even process the content to enable it to categorize and manage its data. The content of the information inventory then enables decisions on how particular parts of an enterprise data have to be managed. Some data may have to be moved and stored in an enterprise content management repository for safe keeping while other data can be managed in-place, redundant, outdated and trivial data may be simply deleted. One way to create an information inventory that enables the management of enterprise data is to create a data warehouse that hosts information about the actual unstructured data objects. Data warehouses can be implemented using database management systems (DBMSs) tuned to handle large amounts of meta information that can then be queried and analyzed. In contrast to traditional DBMSs where the short response times of transactions (usually a few milliseconds) is an essential performance indicator, the response times for on-line analytical processing (OLAP) actions performed on data warehouse databases can vary from seconds to hours or even days. Another quite different requirement is the percentage distribution of read and write operations. In on-line transactional processing (OLTP) systems the different operations are well-balanced, whereas OLAP systems are read-mostly. This thesis shows how a OLAP-based system that enables enterprises to manage its unstructured data can be improved through the use of column storage and compression in the underlying database management system. In particular the investigation transferred key workload from the IBM StoredIQ information appliance to run on IBM DB2 Universal Database 10.5 with “BLU Acceleration”. Besides performance improvements on the migrated workload, the thesis also demonstrates that complex and inflexible data warehousing query acceleration technologies, such as the up-front aggregation of data in so called analytics cubes, may no longer be necessary as complex queries can directly be run against a large data warehouse when columnar storage and compression technologies are used. This enables a much simplified architecture that is also able to address changing query and analytics requirements much more flexibly.
Abteilung(en)	Universität Stuttgart, Institut für Parallele und Verteilte Systeme, Anwendersoftware
Betreuer	Mitschang, Prof. Bernhard; Lorch, Dr. Markus; Schmidt, Sebastian
Eingabedatum	31. Juli 2018

Publ. Informatik