Boilerplate Detection using Shallow Text Features

Paper

Christian Kohlschütter, Peter Fankhauser and Wolfgang Nejdl,
Boilerplate Detection using Shallow Text Features,
WSDM 2010 -- The Third ACM International Conference on Web Search and Data Mining New York City, NY USA.

Download PDF

ABSTRACT. In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. In this paper, we analyze a small set of shallow text features for classifying the individual text elements in a Web page. We compare the approach to complex, state-of-the-art techniques and show that competitive accuracy can be achieved, at almost no cost. Moreover, we derive a simple and plausible stochastic model for describing the boilerplate creation process. With the help of our model, we also quantify the impact of boilerplate removal to retrieval performance and show significant improvements over the baseline. Finally, we extend the principled approach by straight-forward heuristics, achieving a remarkable accuracy.

WSDM2010 presentation

The slides can be found here (PDF).

L3S-GN1 dataset

The data is available online, for free but only for research purposes. Click here to access the dataset (please follow the instructions at the login prompt).

Code

Please check out Boilerpipe, the boilerplate removal library based upon the paper.

The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a website.

The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings. Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and usually quite accurate.

Boilerpipe is a Java library written by Christian Kohlschütter. It is released under the Apache License 2.0.

Last modification: 2010-02-03 Christian Kohlschütter