Master Thesis MSTR-2016-88

BibliographyWalter, Johannes: Design and implementation of a fault simulation layer for the combination technique on HPC systems.
University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 88 (2016).
95 pages, english.
Abstract

In today's supercomputers, computing power is achieved by using a large amount of parallel executed processors. With growing amount of simultaneously used processors, the probability of hardware faults with resulting process failures grows as well. A popular standard for exchanging messages in networks is MPI. Current MPI versions are not fault-tolerant and terminate the whole MPI network in case of faults. ULFM, which is a proposed fault-tolerant extension of MPI, is not stable implemented and not available on supercomputers. In this master's thesis, a concept of a fault simulator as intermediate layer between MPI and application is introduced and implemented. By means of this fault simulator, process crashes and the behavior of ULFM shall be able to be simulated, without resulting in termination of the underlying MPI network.

Full text and
other links
Volltext
Department(s)University of Stuttgart, Institute of Parallel and Distributed Systems, Simulation Software Engineering
Superviser(s)Pflüger, Jun.-Prof. Dirk; Heene, Mario
Entry dateJune 19, 2019
   Publ. Computer Science