Master Thesis MSTR-2020-72

Bibliography	Elzayat, Kareem: Incremental Schema-agnostic Blocking for Entity Resolution of Web Data. University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 72 (2020). 57 pages, english.
Abstract	Entity resolution is the process of identifying data items that differ in their content but represent the same real-world object. When it comes to Web data, data items are mostly unstructured and arrive in huge streams. The inconsistency in the structure of the data, the high speed at which data is generated, and the huge amount of data items present on the Web, impose serious challenges on entity resolution. These challenges, if not dealt with properly, greatly hinder the scalability and effectiveness of entity resolution when applied on Web data. Overcoming them calls for an incremental workflow that utilizes schema-agnostic blocking methods capable of handling the huge streams of unstructured data. The importance of blocking lies in minimizing the number of pairwise-comparisons among the data items to achieve a scalable and effective workflow. Additionally, utilizing incremental and schema-agnostic methods form the basis of efficient handling of streams of structure-inconsistent data. We propose an incremental entity resolution workflow that uses the analyzer, a component whose core lies in implementing incremental schema-agnostic blocking methods. We have developed a novel method based on representativity that incrementally utilizes the found matches to pick a representative and minimize the number of comparisons. Moreover, we accompany this method with incremental versions of two state-of-the-art methods, both of which help to further reduce the number of comparisons. Extensive experiments on 5 real datasets have shown that our novel method alone is inadequate for scalability and that it does not result in a significant enhancement compared to a naive approach. On the other hand, accompanying it with the other two methods leads to a much more scalable, efficient, and effective workflow.
Department(s)	University of Stuttgart, Institute of Parallel and Distributed Systems, Data Engineering
Superviser(s)	Herschel, Prof. Melanie; Gazzarri, Leonardo
Entry date	April 22, 2021

Publ. Computer Science