Abstract:
Bloom filters are extensively used in distributed applications,
especially in distributed databases and distributed information
systems, to reduce network requirements and to increase performance.
In this work, we propose two novel Bloom filter features that are
important for distributed databases and information systems. First,
we present a new approach to encode a Bloom filter such that its
length can be adapted to the cardinality of the set it represents,
with negligible overhead with respect to computation and false
positive probability. The proposed encoding allows for significant
network savings in distributed databases, as it enables the
participating nodes to optimize the length of each Bloom filter
before sending it over the network, for example, when executing
Bloom joins. Second, we show how to estimate the number of
distinct elements in a Bloom filter, for situations where the
represented set is not materialized. These situations frequently
arise in distributed databases, where estimating the cardinality of
the represented sets is necessary for constructing an efficient
query plan. The estimation is highly accurate and comes with tight
probabilistic bounds. For both features we provide a thorough
probabilistic analysis and extensive experimental evaluation which
confirm the effectiveness of our approaches.
@article {springerlink:10.1007/s10619-010-7067-2,