Bachelorarbeit BCLR-2020-37

Berg, Jan: Improve content extraction in web pages for browser reader modes.
Universität Stuttgart, Fakultät Informatik, Elektrotechnik und Informationstechnik, Bachelorarbeit Nr. 37 (2020).
41 Seiten, englisch.

Web content extraction is the process of extracting specific information on websites with the help of an algorithm. It is used for a variety of different applications. Search engines use it to find the relevant information on a website to help index the website. Browser read modes improve the user experience by only showing the main content of the website to the user and removing all the noise like advertisements and navigational elements. The problem with main content extraction is that there is no perfect solution to it. Algorithms try to guess the important content of a website and not always succeed with that. The most used main content extraction algorithms today work by analyzing the underlying HTML structure of the website based on hand tuned heuristics such as word count and the used HTML tags. They do not consider other aspects such as position and size of elements. In this work we try to improve the accuracy of main content extraction algorithms currently used with the help of visual features such as position and size of elements. To evaluate the results we implemented two versions of a main content extraction algorithm as a plugin for the Chromium web browser. The first version only used heuristics based on features from the website that can be read directly from the HTML source file. The second algorithm additionally takes the styling of the website into account which requires parsing the HTML and CSS files files of the website. Based on our measurements the visual based algorithm had a higher accuracy than the normal algorithm (80,1% instead of 73,2%).

Volltext und
andere Links
Abteilung(en)Universität Stuttgart, Institut für Architektur von Anwendungssystemen
BetreuerAiello, Prof. Marco
Eingabedatum12. November 2020
   Publ. Informatik