| Bibliography | Pfeffer, Moritz: Leveraging Large Language Models for Entity Matching. University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 132 (2024). 95 pages, english.
|
| Abstract | Entity matching can be viewed as a classification problem that aims to identify matches between profiles that refer to the same entity, but differ in how they represent the entity. It is at the heart of various applications, from disambiguating individuals in census data to comparing products online. Yet, existing solutions are often not effective on semi-structured data. Recently, the emergence of large language models (LLMs) has led to the development of LLM classifiers for entity matching. We expand on this by proposing composite matchers that efficiently and effectively integrate the LLM classifier. Using our composite matchers we also tackle the problem of effectiveness on semi-structured data.
To this end, we investigate schema-agnostic entity matching in the batch setting using the gpt-3.5-turbo-instruct LLM by OpenAI. Our methodology can be applied using any LLM that provides token probabilities. We construct an LLM classifier that classifies profile pairs as matches or non-matches and provides confidence scores based on LLM token probabilities. This is a departure from previous studies that focused solely on sampled responses without confidence information. Our LLM matcher is similar to existing solutions in that it uses the LLM classifier but disregards confidences making it a standard binary classifier. It exhibits only mediocre classification performance as measured by F1 scores on many well-established benchmark datasets.
In our novel approach, we go beyond the LLM matcher and develop composite matchers that efficiently and effectively integrate the LLM classifier. These composite matchers include the discarding matcher, the selective matcher, and the discarding selective matcher. The discarding matcher efficiently integrates the LLM classifier by placing a fast similarity-based discarder in front of it. The discarder achieves significant cost and time savings, while only slightly decreasing classification performance, as measured by the F1 score. The selective matcher uses the confidences returned by the LLM classifier to guide manual labeling. It outperforms the state-of-the-art deep learning approach DITTO in a setting of limited label availability. Finally, the discarding selective matcher combines efficiency with human labeling for effectiveness. We find that even when human labeling is required for high quality, a significant amount of pairs can be discarded without negatively affecting classification performance. We perform comparisons along the dimensions of time, cost, classification performance, and discarding error which results in a recommendation for composite matchers which offer desirable tradeoffs between efficiency and effectiveness.
Beyond the narrow focus on structured, relational data, we broaden our evaluation to encompass diverse datasets, including the semi-structured DBpedia dataset and datasets corrupted by synthetic data errors. We find that the classification performance of the LLM Matcher on DBpedia is better than on many structured datasets. Additionally, we find that the recommended composite matchers perform as intended on the DBpedia dataset and accomplish their respective objectives of efficiency or effectiveness. The fast and cheap matcher demonstrates significant cost and time savings of roughly 80% without compromising classification performance. The two high-quality matchers achieve a strong classification performance with an F1 score of 0.96 on DBpedia. When dealing with synthetic data errors, we observe that a mixed schema with embedded values, which may be present in dirty semi-structured datasets, poses a significant challenge for the LLM and, therefore, the LLM matcher. Additionally, we uncover instances of LLM brittleness, such as sensitivity to attribute order and slight variations in prompt design. Overall, our insights contribute to the development of strategies for the efficient and effective integration of LLMs for entity matching across diverse datasets.
|