Master Thesis MSTR-2020-43

BibliographySchuff, Hendrik: Explainable question answering beyond F1: metrics, models and human evaluation.
University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 43 (2020).
109 pages, english.

Explainable question answering systems not only predict an answer, but also provide an explanation why it has been selected. Current work predominantly focuses on the evaluation and development of new models around established metrics such as F1. We argue that this constitutes a distorted incentive and limits the exploration of explainability methods as the ultimate measure of performance should not be F1 but the value that the system adds for a human. In this thesis, we analyze two baseline models trained on the HotpotQA data set, which provides explanations in the form of a selection of supporting facts from Wikipedia articles. We identify two weaknesses: (i) the models predict facts to be irrelevant but still include them into their answer and (ii) the models do not use facts for answering the question although they report them to be relevant. Based on these shortcomings, we propose two methods to quantify how strongly a system's answer is coupled to its explanation based on (i) how robust the system's answer prediction is against the removal of facts it predicts to be (ir)relevant and (ii) the location of the answer span. In order to address the identified weaknesses, we present (i) a novel neural network architecture that guarantees that no facts which are predicted to be irrelevant are used in the answer prediction, (ii) a post-hoc heuristic that reduces the number of unused facts and (iii) a regularization term that explicitly couples the prediction of answer and explanation. We show that our methods improve performance on our proposed metrics and assess them within an online study. Even though our methods only reach slight improvements on standard metrics, they all improve various human measures such as decision correctness and certainty, supporting our claim that F1 alone is not suited to evaluate explainability. The regularized model even surpasses the ground truth condition regarding helpfulness and certainty. We analyze how strongly different metrics are linked to human measures and find that our metrics outperform all evaluated standard metrics, suggesting they provide a valuable addition to automatized explainability evaluation.

Full text and
other links
Department(s)University of Stuttgart, Institute for Natural Language Processing
Superviser(s)Vu, Prof. Thang; Adel, Dr. Heike
Entry dateMarch 3, 2021
   Publ. Computer Science