[commits] [tarantool] 03/04: bloom: optimize tuple bloom filter size

Wed Mar 28 00:09:38 MSK 2018

* Vladimir Davydov <vdavydov.dev at gmail.com> [18/03/21 16:36]:
> 

Please add references to the papers you have been using to
construct the hash functions and calculate the theoretical false
positive rate. 

In future, It would be nice to actually have stats about actual false
positive rates so that we know if the papers are BS or not.

Otherwise this part is OK.

>     bloom: optimize tuple bloom filter size
>     
>     When we check if a multi-part key is hashed in a bloom filter, we check
>     all its sub keys as well so the resulting false positive rate will be
>     equal to the product of multiplication of false positive rates of bloom
>     filters created for each sub key.
>     
>     The false positive rate of a bloom filter is given by the formula:
>     
>       f = (1 - exp(-kn/m)) ^ k
>     
>     where m is the number of bits in the bloom filter, k is the number of
>     hash functions, and n is the number of elements hashed in the filter.
>     By varying n, we can estimate the false positive rate of an existing
>     bloom filter when used for a greater number of elements, in other words
>     we can estimate the false positive rate of a bloom filter created for
>     checking sub keys when used for checking full keys.
>     
>     Knowing this, we can adjust the target false positive rate of a bloom
>     filter used for checking keys of a particular length based on false
>     positive rates of bloom filters used for checking its sub keys. This
>     will reduce the number of hash functions required to conform to the
>     configured false positive rate and hence the bloom filter size.
>     
>     Follow-up #3177
> ---
>  src/box/tuple_bloom.c     | 14 +++++++++++++-
>  src/lib/salad/bloom.c     | 13 +++++++++++++
>  src/lib/salad/bloom.h     |  9 +++++++++
>  test/vinyl/bloom.result   | 18 ++++++++++++++++++
>  test/vinyl/bloom.test.lua | 17 +++++++++++++++++
>  5 files changed, 70 insertions(+), 1 deletion(-)
> 
> diff --git a/src/box/tuple_bloom.c b/src/box/tuple_bloom.c
> index 65e9827..6141a9b 100644
> --- a/src/box/tuple_bloom.c
> +++ b/src/box/tuple_bloom.c
> @@ -122,8 +122,20 @@ tuple_bloom_new(struct tuple_bloom_builder *builder, double fpr)
>  	for (uint32_t i = 0; i < part_count; i++) {
>  		struct tuple_hash_array *hash_arr = &builder->parts[i];
>  		uint32_t count = hash_arr->count;
> +		/*
> +		 * When we check if a key is stored in a bloom
> +		 * filter, we check all its sub keys as well,
> +		 * which reduces the resulting false positive
> +		 * rate. Take this into account and adjust fpr
> +		 * accordingly when constructing a bloom filter
> +		 * for keys of a higher rank.
> +		 */
> +		double part_fpr = fpr;
> +		for (uint32_t j = 0; j < i; j++)
> +			part_fpr /= bloom_fpr(&bloom->parts[j], count);
> +		part_fpr = MIN(part_fpr, 0.5);
>  		if (bloom_create(&bloom->parts[i], count,
> -				 fpr, runtime.quota) != 0) {
> +				 part_fpr, runtime.quota) != 0) {
>  			diag_set(OutOfMemory, 0, "bloom_create",
>  				 "tuple bloom part");
>  			for (uint32_t j = 0; j < i; j++)
> diff --git a/src/lib/salad/bloom.c b/src/lib/salad/bloom.c
> index dec9d2a..c38265d 100644
> --- a/src/lib/salad/bloom.c
> +++ b/src/lib/salad/bloom.c
> @@ -73,6 +73,19 @@ bloom_destroy(struct bloom *bloom, struct quota *quota)
>  	free(bloom->table);
>  }
>  
> +double
> +bloom_fpr(const struct bloom *bloom, uint32_t number_of_values)
> +{
> +	/* Number of hash functions. */
> +	uint16_t k = bloom->hash_count;
> +	/* Number of bits. */
> +	uint64_t m = bloom->table_size * sizeof(struct bloom_block) * CHAR_BIT;
> +	/* Number of elements. */
> +	uint32_t n = number_of_values;
> +	/* False positive rate. */
> +	return pow(1 - exp((double) -k * n / m), k);
> +}
> +
>  size_t
>  bloom_store_size(const struct bloom *bloom)
>  {
> diff --git a/src/lib/salad/bloom.h b/src/lib/salad/bloom.h
> index e28ca5f..8efcd8f 100644
> --- a/src/lib/salad/bloom.h
> +++ b/src/lib/salad/bloom.h
> @@ -126,6 +126,15 @@ static bool
>  bloom_possible_has(const struct bloom *bloom, bloom_hash_t hash);
>  
>  /**
> + * Return the expected false positive rate of a bloom filter.
> + * @param bloom - the bloom filter
> + * @param number_of_values - number of values stored in the filter
> + * @return - expected false positive rate
> + */
> +double
> +bloom_fpr(const struct bloom *bloom, uint32_t number_of_values);
> +
> +/**
>   * Calculate size of a buffer that is needed for storing bloom table
>   * @param bloom - the bloom filter to store
>   * @return - Exact size
> diff --git a/test/vinyl/bloom.result b/test/vinyl/bloom.result
> index 79ed06f..841a181 100644
> --- a/test/vinyl/bloom.result
> +++ b/test/vinyl/bloom.result
> @@ -73,6 +73,24 @@ box.snapshot()
>  ---
>  - ok
>  ...
> +--
> +-- There are 1000 unique tuples in the index. The cardinality of the
> +-- first key part is 100, of the first two key parts is 500, of the
> +-- first three key parts is 1000. With the default bloom fpr of 0.05,
> +-- we use 5 hash functions or 5 / ln(2) ~= 7.3 bits per tuple.If we
> +-- allocated a full sized bloom filter per each sub key, we would need
> +-- to allocate at least (100 + 500 + 1000 + 1000) * 7 bits or 2275
> +-- bytes. However, since we adjust the fpr of bloom filters of higher
> +-- ranks (because a full key lookup checks all its sub keys as well),
> +-- we use 5, 4, 3, and 1 hash functions for each sub key respectively.
> +-- This leaves us only (100*5 + 500*4 + 1000*3 + 1000*1) / ln(2) bits
> +-- or 1172 bytes, and after rounding up to the block size (128 byte)
> +-- we have 1280 bytes plus the header overhead.
> +--
> +s.index.pk:info().disk.bloom_size
> +---
> +- 1303
> +...
>  _ = new_reflects()
>  ---
>  ...
> diff --git a/test/vinyl/bloom.test.lua b/test/vinyl/bloom.test.lua
> index 14790da..3f80f20 100644
> --- a/test/vinyl/bloom.test.lua
> +++ b/test/vinyl/bloom.test.lua
> @@ -29,6 +29,23 @@ function new_seeks() local o = seeks seeks = cur_seeks() return seeks - o end
>  
>  for i = 1, 1000 do s:replace{math.ceil(i / 10), math.ceil(i / 2), i, i * 2} end
>  box.snapshot()
> +
> +--
> +-- There are 1000 unique tuples in the index. The cardinality of the
> +-- first key part is 100, of the first two key parts is 500, of the
> +-- first three key parts is 1000. With the default bloom fpr of 0.05,
> +-- we use 5 hash functions or 5 / ln(2) ~= 7.3 bits per tuple.If we
> +-- allocated a full sized bloom filter per each sub key, we would need
> +-- to allocate at least (100 + 500 + 1000 + 1000) * 7 bits or 2275
> +-- bytes. However, since we adjust the fpr of bloom filters of higher
> +-- ranks (because a full key lookup checks all its sub keys as well),
> +-- we use 5, 4, 3, and 1 hash functions for each sub key respectively.
> +-- This leaves us only (100*5 + 500*4 + 1000*3 + 1000*1) / ln(2) bits
> +-- or 1172 bytes, and after rounding up to the block size (128 byte)
> +-- we have 1280 bytes plus the header overhead.
> +--
> +s.index.pk:info().disk.bloom_size
> +
>  _ = new_reflects()
>  _ = new_seeks()
>  
> 
> -- 
> To stop receiving notification emails like this one, please contact
> the administrator of this repository.

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov