Abstract:
XML clustering finds many applications, ranging
from storage to query processing. However, existing
clustering algorithms focus on static XML collections, whereas
modern information systems frequently deal with streaming XML data
that needs to be processed online. Streaming XML clustering is a
challenging task because of the high computational and
space efficiency requirements implicated for online
approaches. In this paper we propose XStreamCluster,
which addresses the two challenges
using a two-layered optimization. The bottom layer employs Bloom
filters to encode the XML documents,
providing a space-efficient solution to memory usage. The top layer
is based on Locality Sensitive Hashing and contributes to the
computational efficiency. The theoretical analysis shows that the
approximate solution of XStreamCluster generates similarly good
clusters as the exact solution, with high probability. The
experimental results demonstrate that XStreamCluster improves both
memory efficiency and computational time by at least an order of
magnitude without affecting clustering quality, compared to its
variants and a baseline approach.
To appear in: DASFAA 2011, full paper