Artikel in Tagungsband INPROC-2016-29

Bibliograph.
Daten
Kiefer, Cornelia: Assessing the Quality of Unstructured Data: An Initial Overview.
In: Krestel, Ralf (Hrsg); Mottin, Davide (Hrsg); Müller, Emmanuel (Hrsg): Proceedings of the LWDA 2016 Proceedings (LWDA).
Universität Stuttgart, Fakultät Informatik, Elektrotechnik und Informationstechnik.
S. 62-73, englisch.
Aachen: CEUR Workshop Proceedings, September 2016.
ISBN: 1613-0073.
Artikel in Tagungsband (Konferenz-Beitrag).
CR-Klassif.A.1 (General Literature, Introductory and Survey)
I.2.7 (Natural Language Processing)
Keywordsquality of unstructured data, quality of text data, data, quality dimensions, data quality assessment, data quality metrics
Kurzfassung

In contrast to structured data, unstructured data such as texts, speech, videos and pictures do not come with a data model that enables a computer to use them directly. Nowadays, computers can interpret the knowledge encoded in unstructured data using methods from text analytics, image recognition and speech recognition. Therefore, unstructured data are used increasingly in decision-making processes. But although decisions are commonly based on unstructured data, data quality assessment methods for unstructured data are lacking. We consider data analysis pipelines built upon two types of data consumers, human consumers that usually come at the end of the pipeline and non-human / machine consumers (e.g., natural language processing modules such as part of speech tagger and named entity recognizer) that mainly work intermediate. We define data quality of unstructured data via (1) the similarity of the input data to the data expected by these consumers of unstructured data and via (2) the similarity of the input data to the data representing the real world. We deduce data quality dimensions from the elements in analytic pipelines for unstructured data and characterize them. Finally, we propose automatically measurable indicators for assessing the quality of unstructured text data and give hints towards an implementation.

Volltext und
andere Links
Link zum Paper
Link zu den Proceedings
Kontaktcornelia.kiefer@gsame.uni-stuttgart.de
Abteilung(en)Universität Stuttgart, Institut für Parallele und Verteilte Systeme, Anwendersoftware
Eingabedatum19. September 2016
   Publ. Institut   Publ. Informatik