Diploma Thesis DIP-2011-42

BibliographyMüller, Thomas: The Morphological Component of a Joint Morphological-Distributional Class Language Model.
University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Diploma Thesis No. 42 (2011).
70 pages, english.
Abstract

Modeling of out-of-vocabulary (OOV) words, i.e. words that do not occur in the training corpus but in the natural language processing (NLP) task at hand, is a challenging problem of statistical language modeling. We empirically investigate the relation between word context, word class and word morphology in English and present a class-based language model, which groups rare words of similar morphology together. The model improves the prediction of words after histories containing out-of-vocabulary words. The morphological features used are obtained without the use of labeled data, but produce a number of syntactically and even semantically related clusters. The overall perplexity improvement achieved by our model is 4% compared to a state of the art Kneser-Ney model and 81% on unknown histories. We conclude that the usage of morphological features in English language modeling is worthwhile.

Department(s)University of Stuttgart, Institute of Parallel and Distributed Systems, Applications of Parallel and Distributed Systems
Superviser(s)Mitschang, Prof. Bernhard; Schütze, Prof. Hinrich
Entry dateMay 4, 2020
   Publ. Computer Science