Master Thesis MSTR-2022-102

BibliographySihag, Nidhi: Generating TEI-based XML for literary texts.
University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 102 (2022).
74 pages, english.


Generating TEI-based XML files for literary texts is a long-standing problem in the Nat- ural Language Processing. It is a task that requires developing a system to encode the text in their relevant TEI tags. We address the challenge of enriching the plain text with the learned XML elements. We are going to deal with the theatre plays (i.e. dramatic texts) and letters. These are encoded in XML. For now, we have these XML files for a few hun- dred plays and letters, but, as we can probably imagine , creating this kind of annotation manually is a lot of work. And since when new plays are digitized, they are initially only available as (OCRd) plain text. So we tried to build an automatic process for this. So that if these XML elements are recognized as an annotation, we could predict them essentially as a sequence labeling task.

This thesis takes its starting point from the recent advances in Natural Language Pro- cessing being developed upon the Transformer model. One of the significant develop- ments recently was the release of a deep bidirectional encoder called BERT that broke several state-of-the-art results at its release. BERT utilises Transfer Learning to improve modelling language dependencies in texts. BERT is used for several different Natural Language Processing tasks, this thesis looks at Named Entity Recognition, sometimes referred to as sequence classification.

The purpose of this thesis is to investigate whether Bidirectional Encoder Represen- tations from Transformers (BERT) is suitable for the automatic annotation of plain text. Therefore, we follow a deep learning approach for the extraction of plain text along with its tags from XML files. We use a neural network architecture based on BERT, a deep language representation model that has significantly increased performance on many nat- ural language processing tasks. We experiment with different BERT models and input formats. The experiments are evaluated on a challenging dataset that contains letters in English and plays in multi-languages.

Full text and
other links
Department(s)University of Stuttgart, Institute for Natural Language Processing
Superviser(s)Kuhn, Prof. Jonas; Pagel, Janis
Entry dateApril 18, 2023
   Publ. Computer Science