Diplomarbeit DIP-2011-42

Bibliograph. Daten	Müller, Thomas: The Morphological Component of a Joint Morphological-Distributional Class Language Model. Universität Stuttgart, Fakultät Informatik, Elektrotechnik und Informationstechnik, Diplomarbeit Nr. 42 (2011). 70 Seiten, englisch.
Kurzfassung	Modeling of out-of-vocabulary (OOV) words, i.e. words that do not occur in the training corpus but in the natural language processing (NLP) task at hand, is a challenging problem of statistical language modeling. We empirically investigate the relation between word context, word class and word morphology in English and present a class-based language model, which groups rare words of similar morphology together. The model improves the prediction of words after histories containing out-of-vocabulary words. The morphological features used are obtained without the use of labeled data, but produce a number of syntactically and even semantically related clusters. The overall perplexity improvement achieved by our model is 4% compared to a state of the art Kneser-Ney model and 81% on unknown histories. We conclude that the usage of morphological features in English language modeling is worthwhile.
Abteilung(en)	Universität Stuttgart, Institut für Parallele und Verteilte Systeme, Anwendersoftware
Betreuer	Mitschang, Prof. Bernhard; Schütze, Prof. Hinrich
Eingabedatum	4. Mai 2020