Master Thesis MSTR-2022-53

BibliographySangolli, Suhas Devendrakeerti: Deep learning in stream entity resolution.
University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 53 (2022).
82 pages, english.
Abstract

Entity Resolution (ER) determines which virtual representations of entities map to the same real-world entity. Most current ER-related research in big-data scenarios focuses on volume and variety problems. However, with increased digitization, data is not only generated in bulk but also in a continuous fashion. So, velocity is also an issue that needs to be addressed in the ER domain. Another major issue in the deep learning-based ER is data labelling. It is hard to find pre-labelled data to train the model, and it turns out even more difficult when new data is being streamed continuously. In this thesis, we aim to address all the aforementioned issues by developing a deep learning-based classification function that incorporates continuous streaming entity pairs and classifies them into match or not-match. The end-to-end system has two main layers; one for training and another for prediction. In the training layer, we use a pre-trained language model (DistilBERT) as a base and train it iteratively as newer entity pairs arrive. To train the model, labelled data are obtained through active learning. The prediction layer makes use of the latest trained model to classify the streaming entity pairs into match or non-match. Both training and prediction layers function in parallel and independent of each other. We evaluate the system proposed in this thesis on several benchmark datasets that vary in size, skewness and origin-domain. As a evaluation metrics we use F1 score, losses, time and iterations. Our iterative model performs similar to the non-iterative models by achieving a match class’s f1 score of 0.97 for benchmark datasets.

Full text and
other links
Volltext
Department(s)University of Stuttgart, Institute of Parallel and Distributed Systems, Data Engineering
Superviser(s)Herschel, Prof. Melanie; Gazzarri, Leonardo
Entry dateOctober 28, 2022
   Publ. Computer Science