Master Thesis MSTR-2020-86

BibliographyBreyer, Marcel: Distributed k-nearest neighbors using Locality Sensitive Hashing and SYCL.
University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 86 (2020).
111 pages, english.

More and more data is made available nowadays. This has the consequence that automatic or semi-automatic methods must be used to process these large data sets. One example of such a method is the classification. The nearest neighbor classifier, for example, assigns a class to a data point x based on its neighborhood. The naive approach to calculate nearest neighbors compares the data point x to all other data points in the data set. However, calculating the nearest neighbors for each data point using the naive approach has a quadratic complexity. This becomes more infeasible, the larger the data set grows. Therefore, approximate algorithms become more attractive. Such an algorithm is the Locality-Sensitive Hashing (LSH) algorithm, which uses hash tables together with locality-sensitive hash functions to reduce the data points that must be examined to calculate the nearest neighbors. In the course of this work sycl_lsh library has been developed. This library implements the LSH algorithm with two different locality-sensitive hash functions, random projections, and entropy-based hash functions. The implementation uses C++17 together with SYCL, which is an abstraction layer for OpenCL that allows targeting different hardware with a single source code. To support large data sets, the implementation utilizes multiple GPUs using MPI to enable the usage of both shared and distributed memory systems. The results include specific tests for all important runtime parameters showing their influence on the runtime, recall, and error ratio. Knowing the behavior of the LSH algorithm concerning the different parameters is essential to be able to tune the algorithm to achieve the desired results while meeting the runtime requirements. These tests have been conducted for both hash function types, which are further compared to each other. Besides, the obtained results show that the used approach can easily scale on multiple GPUs using both locality-sensitive hash function types, achieving a parallel speedup of up to 7.0 when utilizing eight GPUs. Furthermore, it is shown that the sycl_lsh library can be used with three different SYCL implementations, ComputeCpp, hipSYCL, and oneAPI, to target different hardware architectures without any significant performance differences.

Full text and
other links
Department(s)University of Stuttgart, Institute of Parallel and Distributed Systems, Scientific Computing
Superviser(s)Pflüger, Prof. Dirk; Daiß, Gregor
Entry dateAugust 16, 2021
   Publ. Computer Science