| Kurzfassung | The increasing importance of Information Retrieval (IR) in handling large datasets has highlighted significant limitations in traditional keyword-based search systems. These systems (i.e., Confluence) often result in user dissatisfaction due to issues like exact match and lack of semantic search. In response, we present a Retrieval-Augmented Generation (RAG) system for Confluence. It is a chatbot-based solution that integrates Confluence data into a vector database, retrieves semantically relevant content, and uses Large Language Model (LLM) to generate user-friendly responses. This study aims to evaluate user satisfaction with the existing Confluence search engine, identify the most suitable embedding model and LLM for a proposed RAG solution, establish an effective comparison method between a keyword-based and RAG-based search engine, and assess the performance of the newly implemented solution. To achieve these objectives, we conducted a case study within a German company. We began with semi-structured interviews to gather user experiences and the need for alternatives to the existing Confluence search engine. Based on this feedback, we implemented a RAG system, introduced a novel data chunking approach and run performance benchmarks of embedding models and LLMs using an innovative scoring system. Additionally, we applied a novel evaluation framework to compare keyword-based search engines with RAG systems. We based it on Click-Through Rate (CTR) of search results and equivalent RAG positions, which we then evaluated with Mann-Whitney U Test. Lastly, to triangulate our findings, we conducted surveys and unstructured interviews with users of the RAG system. Our results demonstrate that users are indeed dissatisfied with the Confluence search engine and anticipate a semantic search functionality. We also found that for our RAG, the most suitable embedding model is jinaai/jina-embeddings-v2-base-de and LLM Llama 3.1. Statistical tests showed that our RAG system outperforms the Confluence search engine in IR. Survey and unstructured follow-up interviews substantiate this finding. We conclude that semantic search, summarization capabilities, and provision of sources alongside responses contributed to significantly higher user satisfaction. Despite these improvements, users familiar with keyword-based search engines expressed a preference for them. We also found that the retrieval-only component performs better than RAG. Our findings indicate monolingual embedding models perform better than multilingual ones for German data. We concluded that there is no universal evaluation method for comparing chatbot-based and keyword-based search systems. Instead, evaluation metrics must be tailored to the specific application and user needs. We therefore highlight the potential of our evaluation framework as a promising approach for comparing chatbots and keyword-based search engines. Further research should assess its generalizability and applicability across different contexts through controlled experiments or additional case studies.
|