Abstract:
Elsevierciently detecting near duplicate resources is an important
task when integrating information from various sources and applications.
Once detected, near duplicate resources can be grouped together,
merged, or removed, in order to avoid repetition and redundancy, and
to increase the diversity in the information provided to the user. In this
paper, we introduce an approach for efficientcient semantic-aware near duplicate
detection, by combining an indexing scheme for similarity search
with the RDF representations of the resources.We provide a probabilistic
analysis for the correctness of the suggested approach, which allows applications
to configure it for satisfying their specific quality requirements.
Our experimental evaluation on the RDF descriptions of real-world news
articles from various news agencies demonstrates the efficientciency and effectiveness
of our approach.
To appear in: ESWC 2010