Masterarbeit MSTR-2022-53

Bibliograph.
Daten
Sangolli, Suhas Devendrakeerti: Deep learning in stream entity resolution.
Universität Stuttgart, Fakultät Informatik, Elektrotechnik und Informationstechnik, Masterarbeit Nr. 53 (2022).
82 Seiten, englisch.
Kurzfassung

Entity Resolution (ER) determines which virtual representations of entities map to the same real-world entity. Most current ER-related research in big-data scenarios focuses on volume and variety problems. However, with increased digitization, data is not only generated in bulk but also in a continuous fashion. So, velocity is also an issue that needs to be addressed in the ER domain. Another major issue in the deep learning-based ER is data labelling. It is hard to find pre-labelled data to train the model, and it turns out even more difficult when new data is being streamed continuously. In this thesis, we aim to address all the aforementioned issues by developing a deep learning-based classification function that incorporates continuous streaming entity pairs and classifies them into match or not-match. The end-to-end system has two main layers; one for training and another for prediction. In the training layer, we use a pre-trained language model (DistilBERT) as a base and train it iteratively as newer entity pairs arrive. To train the model, labelled data are obtained through active learning. The prediction layer makes use of the latest trained model to classify the streaming entity pairs into match or non-match. Both training and prediction layers function in parallel and independent of each other. We evaluate the system proposed in this thesis on several benchmark datasets that vary in size, skewness and origin-domain. As a evaluation metrics we use F1 score, losses, time and iterations. Our iterative model performs similar to the non-iterative models by achieving a match class’s f1 score of 0.97 for benchmark datasets.

Volltext und
andere Links
Volltext
Abteilung(en)Universität Stuttgart, Institut für Parallele und Verteilte Systeme, Data Engineering
BetreuerHerschel, Prof. Melanie; Gazzarri, Leonardo
Eingabedatum28. Oktober 2022
   Publ. Informatik