Bachelorarbeit BCLR-2020-52

Kaiser, Jens: Dimensionality and noise in models of semantic change detection.
Universität Stuttgart, Fakultät Informatik, Elektrotechnik und Informationstechnik, Bachelorarbeit Nr. 52 (2020).
28 Seiten, englisch.

This thesis analyses the topic of"Dimensionality and noise in models of Semantic Change Detection". Semantic Change Detection: Words change their meaning over time due to social and technological influences. This change can be recognised and quantified with special automated models. Automated models for Semantic Change Detection usually consist of three parts. 1) Create word embeddings on corpus t1 and corpus t2, 2) Align vector spaces of the embeddings, 3) Measure changes between vectors. The models must generate a ranking for a given set of so-called "target words", which lists the words according to measured change of meaning. Word embeddings are low (about 2 to 1000) dimensional vector representations of words. Word vectors can be generated in different ways and contain information about semantic relationships between words. In this thesis we use Skip-Gram Negative Sampling (SGNS) to create the embeddings. This is a much used model in Natural Language Processing, especially in the field of Semantic Change Detection. SGNS is based on a neural network that attempts to predict the context of a given word. A large text corpus is used for training. Important hyper-parameters of SGNS, which we investigate are dimensionality and training epochs. Dimensionality, as the name suggests, determines the dimensionality of word vectors. The number of training epochs determines how often SGNS iterates over the corpus. Multiple training runs are used to artificially increase the training data. Data sets for Semantic Change Detection consist of two or more corpora from different time periods. Since word embeddings are created on all corpora, it is important that the vector spaces are aligned. Without alignment they cannot be compared directly. If embeddings are created independently, it is possible that the columns in the vectors represent different axes. We investigate three modern models that address this problem in different ways. 1) Vector Initialisation (VI): Kim et al. (2014) In VI, SGNS is first trained on one of the corpora. Then the weights from the SGNS model are stored and used to initialise the weights in the second SGNS model. The second model then trains on the second corpora. The intuition for this method is that the vectors for the words are already learned when they are used for initialisation. This should only change the vectors of words that have changed their meaning or usage. 2) Orthogonal Procrustes (OP): Hamilton et al (2016) Here two SGNS models are trained independently on the two corpora. Then an orthogonally constrained rotation matrix is calculated, which approximates the vector spaces. 3) Word Injection (WI) Ferrari et al. (2017) Word Injection inserts a special symbol after all target words, which marks the time period they come from (t1 or t2). Afterwards the two corpora are mixed and a large corpus is created. The SGNS model which is trained on this corpus, now creates two vectors for each target word, one for t1 and one for t2. All remaining words get only one vector each. Because the embeddings are generated by the same model they are already aligned to each other. Based on our own observations during the seminar "Lexical Semantic Change Detection" about the behaviour of VI with extremely low dimensionality and the works of Dubossarsky et al. (2018) and Yin and Shan (2018), we developed the following four hypotheses: - The optimal dimensionality is different for all three models. - VI has a lower optimal dimensionality than OP, and OP has a lower one the an WI - VI captures more noise than OP and OP captures more noise than WI with equal dimensionality. - The optimal dimensionality for each model is a function of other parameters such as number of training epochs and corpus size. Noise is defined as information that describes non-semantic relationships between words. These hypotheses served as a guide for the first part of the experiments. We used data sets provided by Schlechtweg et. al. (2020). These consist of four different languages (German, English, Latin and Swedish), which have different corpus sizes and are from different time periods. The German and Swedish corpora, for example, are much larger than the English and Latin corpora. There is also a ranking list for each language, in which the target words are listed according to their true degree of meaning change. This ranking was created manually and is used to determine how well the models can detect semantic. In the first experiment we evaluate the results of the three alignment methods with different dimensions. In addition, a noise measurement with different dimensions is also performed. With the results of this experiment we try to answer the four hypotheses. The first hypothesis could not be definitely confirmed or disproved, because VI and WI often had similar optimal dimensionality. The second hypothesis could therefore not be definitively answered either. In addition, the optimal dimensionality of OP was partly above that of VI and WI, and partly below. The third hypothesis could be refuted by the procedure we used to measure noise in this experiment. It turned out that OP was often the method with the greatest amount of noise. VI and WI showed similar values again. The last of the four hypotheses could also not be answered definitively. How the number of training epochs influences the optimal dimensionality of VI will be investigated in the following experiments. For OP and WI, however, no correlation between the two values could be detected. How the corpus size influences the optimal dimensionality could not be answered, because the results on the two smaller corpora showed very large variances and furthermore no clear optimum of dimensionality could be identified. An interesting observation of the first experiment is that VI with higher dimensionality gets worse and worse results. However, it should be possible to explain the behaviour with higher levels of noise in high dimensions. It turns out that the approach of our noise measurement could not explain the deterioration. The next experiment investigates the relationship between word frequency and measured change for the corresponding word. Such a relationship can be described as a noise. Here, VI shows a correlation between frequency and the change ranking with. The correlation increases with increasing dimensionality. OP and WI show no significant correlation between the two values. With VI, it can be seen that as soon as this correlation exceeds a certain value, the results on the test data begin to deteriorate. We show that this frequency bias is the cause of the bad results. In the final experiment, two approaches are presented which strongly prevent frequency bias and thus improve results with high dimensionality. The first is to increase the epoch count and the second is to normalise the vectors used to initialise the second SGNS model. Why the respective procedures prevent frequency bias could not be identified.

Volltext und
andere Links
Abteilung(en)Universität Stuttgart, Institut für Maschinelle Sprachverarbeitung
BetreuerSchulte im Walde, Prof. Sabine; Schlechtweg, Dominik; Papay
Eingabedatum18. Januar 2021
   Publ. Informatik