Master Thesis MSTR-2008-12

BibliographyOrtiz, Maria Mera: Correlation Measures for Text Analysis Results.
University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 12 (2008).
112 pages, english.

Hamessing unstructured, textual data is becoming more and more important for scenarios such as quality early warning, company reputation management or proactive customer churn detection. Text analysis technologies such as Information Extraction can be used to extract information out of email, call center transcripts, technician comments or customer forums, which can then be used within business intelligence applications.

In this thesis, an approach is presented and implemented that combines text analysis with association rule mining, in the context of a quality early warning scenario. The text analysis approach is focused on extracting syntactical entities like noun phrases, which requires less manual effort than domain-tailored text analysis. Subsequent association rule mining is used to detect the relevant entities found during text analysis, and relate them to structured information, for example, certain car models. As part of the thesis, different correlation measures are evaluated which gauge the interestingness of the association rules.

The quality early warning scenario is based on over 500000 publicly available vehicle complaints from the US National Highway Traffic Safety Administration. The scenario shows how IBM InfoSphere Waherouse can extract relevant information from complaint descriptions about car models, which allows business analysts to investigate potential causes of problems for certain vehicles. In this example, the Quality Early Waming application is built as a set of reports within IBM Cognos 8 BI server. As the results of both text analysis and subsequent data mining are stored in relational tables, the application could be implemented in any BI tool, or custom application.

To evaluate the quality of the approach, the thesis contains a benchmark assessment against IBM Content Analyzer, an IBM product that attempts to cover the same usage sce­narios than the approach taken in this hesis.

Department(s)University of Stuttgart, Institute of Parallel and Distributed Systems, Applications of Parallel and Distributed Systems
Superviser(s)Mitschang, Prof. Bernhard; Lang, Alexander
Entry dateApril 20, 2023
   Publ. Computer Science