Diploma Thesis DIP-3568

BibliographyThiele, Gregor: Graphical Error Mining For Linguistic Annotated Corpora.
University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Diploma Thesis No. 3568 (2013).
73 pages, english.
CR-SchemaD.1.5 (Object-oriented Programming)
H.5.2 (Information Interfaces and Presentation User Interfaces)
J.5 (Arts and Humanities)
Abstract

Abstract

Corpora contain linguistically annotated data. Producing these annotations is a complex process that easily leads to inconsistencies within the annotation. Since corpora are used to evaluate automatic language processing systems the evaluation may suffer when there are too many errors within the data.

This thesis focuses on finding erroneous annotations within corpora. To detect sequence annotation errors within part-of-speech tags we implemented the algorithm introduced by Dickinson and Meurers (2003). Additionally for structured annotations we choose the approach shown in Boyd et al.(2008) that targets inconsistency within dependency structures.

We designed and built a graphical user interface (GUI) that is easy to handle and user-friendly. Implementing state-of-the-art algorithms for error detection with an user-friendly interface increase the operation domain because the algorithms can be used by a wider audience without deeper knowledge of computers. It provides even non-expert users with the capability to find inconsistent pos tags and dependency structures within a corpus. We evaluate the system using the German TIGER corpus and the English Penn Treebank. For the TIGER corpus we also perform a manual evaluation where we sample 115 6-grams and check manually if these contain errors. We find that 94.96% are erroneous and it is easy to decide the correct tag as a human. For 4.20% we can say that these are errors but determining the correct tag is very to difficult. In total we detect errors with a precision of 99.16%. Only one case (0.84%) is not caused by inconsistency but constitutes genuine ambiguity.

Dickinson, M. and Meurers, W. D. (2003). Detecting errors in part-of-speech annotation. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL-03), pages 107–114, Budapest, Hungary.

Boyd, A., Dickinson, M., and Meurers, D. (2008). On detecting errors in dependency treebanks. Research on Language and Computation, 6(2):113–137. (Cited on pages 3, 11, 12, 15, 21, 53 and 55)

Full text and
other links
PDF (1748351 Bytes)
Department(s)University of Stuttgart, Institute for Natural Language Processing
Superviser(s)Seeker; Wolfgang
Entry dateJune 12, 2014
   Publ. Computer Science