From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <vdavydov.dev@gmail.com>
From: Vladimir Davydov <vdavydov.dev@gmail.com>
Subject: [PATCH v2 4/5] bloom: optimize tuple bloom filter size
Date: Wed, 28 Mar 2018 22:05:00 +0300
Message-Id: <cb59c0989d0c94167e7bd36e3468ea2d30d838ef.1522262496.git.vdavydov.dev@gmail.com>
In-Reply-To: <cover.1522262496.git.vdavydov.dev@gmail.com>
References: <cover.1522262496.git.vdavydov.dev@gmail.com>
In-Reply-To: <cover.1522262496.git.vdavydov.dev@gmail.com>
References: <cover.1522262496.git.vdavydov.dev@gmail.com>
To: kostja@tarantool.org
Cc: tarantool-patches@freelists.org
List-ID: <tarantool-patches.dev.tarantool.org>

When we check if a multi-part key is hashed in a bloom filter, we check
all its sub keys as well so the resulting false positive rate will be
equal to the product of multiplication of false positive rates of bloom
filters created for each sub key.

The false positive rate of a bloom filter is given by the formula:

  f = (1 - exp(-kn/m)) ^ k

where m is the number of bits in the bloom filter, k is the number of
hash functions, and n is the number of elements hashed in the filter.
By varying n, we can estimate the false positive rate of an existing
bloom filter when used for a greater number of elements, in other words
we can estimate the false positive rate of a bloom filter created for
checking sub keys when used for checking full keys.

Knowing this, we can adjust the target false positive rate of a bloom
filter used for checking keys of a particular length based on false
positive rates of bloom filters used for checking its sub keys. This
will reduce the number of hash functions required to conform to the
configured false positive rate and hence the bloom filter size.

Follow-up #3177
---
 src/box/tuple_bloom.c     | 14 +++++++++++++-
 src/lib/salad/bloom.c     | 13 +++++++++++++
 src/lib/salad/bloom.h     |  9 +++++++++
 test/vinyl/bloom.result   | 18 ++++++++++++++++++
 test/vinyl/bloom.test.lua | 17 +++++++++++++++++
 5 files changed, 70 insertions(+), 1 deletion(-)

diff --git a/src/box/tuple_bloom.c b/src/box/tuple_bloom.c
index 7e30a284..ffad1515 100644
--- a/src/box/tuple_bloom.c
+++ b/src/box/tuple_bloom.c
@@ -132,8 +132,20 @@ tuple_bloom_new(struct tuple_bloom_builder *builder, double fpr)
 	for (uint32_t i = 0; i < part_count; i++) {
 		struct tuple_hash_array *hash_arr = &builder->parts[i];
 		uint32_t count = hash_arr->count;
+		/*
+		 * When we check if a key is stored in a bloom
+		 * filter, we check all its sub keys as well,
+		 * which reduces the resulting false positive
+		 * rate. Take this into account and adjust fpr
+		 * accordingly when constructing a bloom filter
+		 * for keys of a higher rank.
+		 */
+		double part_fpr = fpr;
+		for (uint32_t j = 0; j < i; j++)
+			part_fpr /= bloom_fpr(&bloom->parts[j], count);
+		part_fpr = MIN(part_fpr, 0.5);
 		if (bloom_create(&bloom->parts[i], count,
-				 fpr, runtime.quota) != 0) {
+				 part_fpr, runtime.quota) != 0) {
 			diag_set(OutOfMemory, 0, "bloom_create",
 				 "tuple bloom part");
 			tuple_bloom_delete(bloom);
diff --git a/src/lib/salad/bloom.c b/src/lib/salad/bloom.c
index a33d2177..09683fb8 100644
--- a/src/lib/salad/bloom.c
+++ b/src/lib/salad/bloom.c
@@ -70,6 +70,19 @@ bloom_destroy(struct bloom *bloom, struct quota *quota)
 	free(bloom->table);
 }
 
+double
+bloom_fpr(const struct bloom *bloom, uint32_t number_of_values)
+{
+	/* Number of hash functions. */
+	uint16_t k = bloom->hash_count;
+	/* Number of bits. */
+	uint64_t m = bloom->table_size * sizeof(struct bloom_block) * CHAR_BIT;
+	/* Number of elements. */
+	uint32_t n = number_of_values;
+	/* False positive rate. */
+	return pow(1 - exp((double) -k * n / m), k);
+}
+
 size_t
 bloom_store_size(const struct bloom *bloom)
 {
diff --git a/src/lib/salad/bloom.h b/src/lib/salad/bloom.h
index 1e3221f8..81d3480e 100644
--- a/src/lib/salad/bloom.h
+++ b/src/lib/salad/bloom.h
@@ -126,6 +126,15 @@ static bool
 bloom_maybe_has(const struct bloom *bloom, bloom_hash_t hash);
 
 /**
+ * Return the expected false positive rate of a bloom filter.
+ * @param bloom - the bloom filter
+ * @param number_of_values - number of values stored in the filter
+ * @return - expected false positive rate
+ */
+double
+bloom_fpr(const struct bloom *bloom, uint32_t number_of_values);
+
+/**
  * Calculate size of a buffer that is needed for storing bloom table
  * @param bloom - the bloom filter to store
  * @return - Exact size
diff --git a/test/vinyl/bloom.result b/test/vinyl/bloom.result
index 79ed06f9..841a181d 100644
--- a/test/vinyl/bloom.result
+++ b/test/vinyl/bloom.result
@@ -73,6 +73,24 @@ box.snapshot()
 ---
 - ok
 ...
+--
+-- There are 1000 unique tuples in the index. The cardinality of the
+-- first key part is 100, of the first two key parts is 500, of the
+-- first three key parts is 1000. With the default bloom fpr of 0.05,
+-- we use 5 hash functions or 5 / ln(2) ~= 7.3 bits per tuple.If we
+-- allocated a full sized bloom filter per each sub key, we would need
+-- to allocate at least (100 + 500 + 1000 + 1000) * 7 bits or 2275
+-- bytes. However, since we adjust the fpr of bloom filters of higher
+-- ranks (because a full key lookup checks all its sub keys as well),
+-- we use 5, 4, 3, and 1 hash functions for each sub key respectively.
+-- This leaves us only (100*5 + 500*4 + 1000*3 + 1000*1) / ln(2) bits
+-- or 1172 bytes, and after rounding up to the block size (128 byte)
+-- we have 1280 bytes plus the header overhead.
+--
+s.index.pk:info().disk.bloom_size
+---
+- 1303
+...
 _ = new_reflects()
 ---
 ...
diff --git a/test/vinyl/bloom.test.lua b/test/vinyl/bloom.test.lua
index 14790dab..3f80f205 100644
--- a/test/vinyl/bloom.test.lua
+++ b/test/vinyl/bloom.test.lua
@@ -29,6 +29,23 @@ function new_seeks() local o = seeks seeks = cur_seeks() return seeks - o end
 
 for i = 1, 1000 do s:replace{math.ceil(i / 10), math.ceil(i / 2), i, i * 2} end
 box.snapshot()
+
+--
+-- There are 1000 unique tuples in the index. The cardinality of the
+-- first key part is 100, of the first two key parts is 500, of the
+-- first three key parts is 1000. With the default bloom fpr of 0.05,
+-- we use 5 hash functions or 5 / ln(2) ~= 7.3 bits per tuple.If we
+-- allocated a full sized bloom filter per each sub key, we would need
+-- to allocate at least (100 + 500 + 1000 + 1000) * 7 bits or 2275
+-- bytes. However, since we adjust the fpr of bloom filters of higher
+-- ranks (because a full key lookup checks all its sub keys as well),
+-- we use 5, 4, 3, and 1 hash functions for each sub key respectively.
+-- This leaves us only (100*5 + 500*4 + 1000*3 + 1000*1) / ln(2) bits
+-- or 1172 bytes, and after rounding up to the block size (128 byte)
+-- we have 1280 bytes plus the header overhead.
+--
+s.index.pk:info().disk.bloom_size
+
 _ = new_reflects()
 _ = new_seeks()
 
-- 
2.11.0