Bachelor Thesis BCLR-2020-37

BibliographyBerg, Jan: Improve content extraction in web pages for browser reader modes.
University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Bachelor Thesis No. 37 (2020).
41 pages, english.
Abstract

Web content extraction is the process of extracting specific information on websites with the help of an algorithm. It is used for a variety of different applications. Search engines use it to find the relevant information on a website to help index the website. Browser read modes improve the user experience by only showing the main content of the website to the user and removing all the noise like advertisements and navigational elements. The problem with main content extraction is that there is no perfect solution to it. Algorithms try to guess the important content of a website and not always succeed with that. The most used main content extraction algorithms today work by analyzing the underlying HTML structure of the website based on hand tuned heuristics such as word count and the used HTML tags. They do not consider other aspects such as position and size of elements. In this work we try to improve the accuracy of main content extraction algorithms currently used with the help of visual features such as position and size of elements. To evaluate the results we implemented two versions of a main content extraction algorithm as a plugin for the Chromium web browser. The first version only used heuristics based on features from the website that can be read directly from the HTML source file. The second algorithm additionally takes the styling of the website into account which requires parsing the HTML and CSS files files of the website. Based on our measurements the visual based algorithm had a higher accuracy than the normal algorithm (80,1% instead of 73,2%).

Full text and
other links
Volltext
Department(s)University of Stuttgart, Institute of Architecture of Application Systems
Superviser(s)Aiello, Prof. Marco
Entry dateNovember 12, 2020
   Publ. Institute   Publ. Computer Science