Diplomarbeit DIP-2011-42

Bibliograph.
Daten
Müller, Thomas: The Morphological Component of a Joint Morphological-Distributional Class Language Model.
Universität Stuttgart, Fakultät Informatik, Elektrotechnik und Informationstechnik, Diplomarbeit Nr. 42 (2011).
70 Seiten, englisch.
Kurzfassung

Modeling of out-of-vocabulary (OOV) words, i.e. words that do not occur in the training corpus but in the natural language processing (NLP) task at hand, is a challenging problem of statistical language modeling. We empirically investigate the relation between word context, word class and word morphology in English and present a class-based language model, which groups rare words of similar morphology together. The model improves the prediction of words after histories containing out-of-vocabulary words. The morphological features used are obtained without the use of labeled data, but produce a number of syntactically and even semantically related clusters. The overall perplexity improvement achieved by our model is 4% compared to a state of the art Kneser-Ney model and 81% on unknown histories. We conclude that the usage of morphological features in English language modeling is worthwhile.

Abteilung(en)Universität Stuttgart, Institut für Parallele und Verteilte Systeme, Anwendersoftware
BetreuerMitschang, Prof. Bernhard; Schütze, Prof. Hinrich
Eingabedatum4. Mai 2020
   Publ. Institut   Publ. Informatik