Article in Proceedings INPROC-2016-29

BibliographyKiefer, Cornelia: Assessing the Quality of Unstructured Data: An Initial Overview.
In: Krestel, Ralf (ed.); Mottin, Davide (ed.); Müller, Emmanuel (ed.): Proceedings of the LWDA 2016 Proceedings (LWDA).
University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology.
pp. 62-73, english.
Aachen: CEUR Workshop Proceedings, September 2016.
ISBN: 1613-0073.
Article in Proceedings (Conference Paper).
CR-SchemaA.1 (General Literature, Introductory and Survey)
I.2.7 (Natural Language Processing)
Keywordsquality of unstructured data, quality of text data, data, quality dimensions, data quality assessment, data quality metrics

In contrast to structured data, unstructured data such as texts, speech, videos and pictures do not come with a data model that enables a computer to use them directly. Nowadays, computers can interpret the knowledge encoded in unstructured data using methods from text analytics, image recognition and speech recognition. Therefore, unstructured data are used increasingly in decision-making processes. But although decisions are commonly based on unstructured data, data quality assessment methods for unstructured data are lacking. We consider data analysis pipelines built upon two types of data consumers, human consumers that usually come at the end of the pipeline and non-human / machine consumers (e.g., natural language processing modules such as part of speech tagger and named entity recognizer) that mainly work intermediate. We define data quality of unstructured data via (1) the similarity of the input data to the data expected by these consumers of unstructured data and via (2) the similarity of the input data to the data representing the real world. We deduce data quality dimensions from the elements in analytic pipelines for unstructured data and characterize them. Finally, we propose automatically measurable indicators for assessing the quality of unstructured text data and give hints towards an implementation.

Full text and
other links
Link zum Paper
Link zu den Proceedings
Department(s)University of Stuttgart, Institute of Parallel and Distributed Systems, Applications of Parallel and Distributed Systems
Entry dateSeptember 19, 2016
New Report   New Article   New Monograph   Institute   Computer Science