Technical Report TR-2010-06

BibliographyHoffmann, Benjamin: Comparison of Standard and Zipf-Based Document Retrieval Heuristics.
University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Technical Report Computer Science No. 2010/06.
17 pages, english.
CR-SchemaH.3.1 (Content Analysis and Indexing)
H.3.3 (Information Search and Retrieval)
H.3.4 (Information Storage and Retrieval Systems and Software)
KeywordsInformation Retrieval; Zipf Model
Abstract

Document retrieval is the task to retrieve from a possibly huge collection of documents those which are most similar to a given query document. In this paper, we present a new heuristic for inexact top K retrieval. It is similar to the well-known index elimination heuristic and is based on Zipf's law, a statistical law observable in natural language texts. We compare the two heuristics with regard to retrieval performance and execution time. Therefore, we use a text collection consisting of scientific articles from various computer science conferences and journals. It turns out that our new approach is not better than index elimination. Interestingly, a combination of both heuristics yields the best results.

Full text and
other links
PDF (190297 Bytes)
Department(s)University of Stuttgart, Institute of Formal Methods in Computer Science, Theoretical Computer Science
Entry dateSeptember 15, 2010
   Publ. Computer Science