Diplomarbeit DIP-3568

Bibliograph.
Daten
Thiele, Gregor: Graphical Error Mining For Linguistic Annotated Corpora.
Universität Stuttgart, Fakultät Informatik, Elektrotechnik und Informationstechnik, Diplomarbeit Nr. 3568 (2013).
73 Seiten, englisch.
CR-Klassif.D.1.5 (Object-oriented Programming)
H.5.2 (Information Interfaces and Presentation User Interfaces)
J.5 (Arts and Humanities)
Kurzfassung

Abstract

Corpora contain linguistically annotated data. Producing these annotations is a complex process that easily leads to inconsistencies within the annotation. Since corpora are used to evaluate automatic language processing systems the evaluation may suffer when there are too many errors within the data.

This thesis focuses on finding erroneous annotations within corpora. To detect sequence annotation errors within part-of-speech tags we implemented the algorithm introduced by Dickinson and Meurers (2003). Additionally for structured annotations we choose the approach shown in Boyd et al.(2008) that targets inconsistency within dependency structures.

We designed and built a graphical user interface (GUI) that is easy to handle and user-friendly. Implementing state-of-the-art algorithms for error detection with an user-friendly interface increase the operation domain because the algorithms can be used by a wider audience without deeper knowledge of computers. It provides even non-expert users with the capability to find inconsistent pos tags and dependency structures within a corpus. We evaluate the system using the German TIGER corpus and the English Penn Treebank. For the TIGER corpus we also perform a manual evaluation where we sample 115 6-grams and check manually if these contain errors. We find that 94.96% are erroneous and it is easy to decide the correct tag as a human. For 4.20% we can say that these are errors but determining the correct tag is very to difficult. In total we detect errors with a precision of 99.16%. Only one case (0.84%) is not caused by inconsistency but constitutes genuine ambiguity.

Dickinson, M. and Meurers, W. D. (2003). Detecting errors in part-of-speech annotation. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL-03), pages 107–114, Budapest, Hungary.

Boyd, A., Dickinson, M., and Meurers, D. (2008). On detecting errors in dependency treebanks. Research on Language and Computation, 6(2):113–137. (Cited on pages 3, 11, 12, 15, 21, 53 and 55)

Volltext und
andere Links
PDF (1748351 Bytes)
Abteilung(en)Universität Stuttgart, Institut für Maschinelle Sprachverarbeitung
BetreuerSeeker; Wolfgang
Eingabedatum12. Juni 2014
   Publ. Informatik