From: Vladimir Davydov <vdavydov.dev@gmail.com> To: kostja@tarantool.org Cc: tarantool-patches@freelists.org Subject: [PATCH v2 05/14] bloom: do not use tuple_common_key_parts when constructing tuple bloom Date: Wed, 13 Mar 2019 11:52:51 +0300 [thread overview] Message-ID: <f10f32b2665668ce5dff95344e54fa9096cc8c68.1552464666.git.vdavydov.dev@gmail.com> (raw) In-Reply-To: <cover.1552464666.git.vdavydov.dev@gmail.com> In-Reply-To: <cover.1552464666.git.vdavydov.dev@gmail.com> Tuple bloom filter is an array of bloom filters, each of which reflects lookups by all possible partial keys. To optimize the overall bloom filter size, we need to know how many unique elements there are for each partial key. To achieve that, we require the caller to pass the number of key parts that have been hashed for the given tuple. Here's how it looks in Vinyl: uint32_t hashed_parts = writer->last_stmt == NULL ? 0 : tuple_common_key_parts(stmt, writer->last_stmt, writer->key_def); tuple_bloom_builder_add(writer->bloom, stmt, writer->key_def, hashed_parts); Actually, there's no need in such a requirement as instead we can calculate the hash value for the given tuple, compare it with the hash of the tuple added last time, and add the new hash only if the two values differ. This should be accurate enough while allowing us to get rid of the cumbersome tuple_common_key_parts helper. Note, such a check will only work if tuples are added in the order defined by the key definition, but that already holds - anyway, one wouldn't be able to use tuple_common_key_parts either if it wasn't true. --- src/box/key_def.h | 11 ----------- src/box/tuple_bloom.c | 15 ++++++++------- src/box/tuple_bloom.h | 8 ++++---- src/box/tuple_compare.cc | 33 --------------------------------- src/box/vy_run.c | 10 ++-------- 5 files changed, 14 insertions(+), 63 deletions(-) diff --git a/src/box/key_def.h b/src/box/key_def.h index dd62f6a3..e3a34ce1 100644 --- a/src/box/key_def.h +++ b/src/box/key_def.h @@ -512,17 +512,6 @@ tuple_extract_key_raw(const char *data, const char *data_end, } /** - * Return the length of the longest common prefix of two tuples. - * @param tuple_a first tuple - * @param tuple_b second tuple - * @param key_def key defintion - * @return number of key parts the two tuples have in common - */ -uint32_t -tuple_common_key_parts(const struct tuple *tuple_a, const struct tuple *tuple_b, - struct key_def *key_def); - -/** * Compare keys using the key definition. * @param key_a key parts with MessagePack array header * @param part_count_a the number of parts in the key_a diff --git a/src/box/tuple_bloom.c b/src/box/tuple_bloom.c index cf887c7b..39d218dc 100644 --- a/src/box/tuple_bloom.c +++ b/src/box/tuple_bloom.c @@ -72,8 +72,7 @@ tuple_bloom_builder_delete(struct tuple_bloom_builder *builder) int tuple_bloom_builder_add(struct tuple_bloom_builder *builder, - const struct tuple *tuple, struct key_def *key_def, - uint32_t hashed_parts) + const struct tuple *tuple, struct key_def *key_def) { assert(builder->part_count == key_def->part_count); @@ -82,17 +81,20 @@ tuple_bloom_builder_add(struct tuple_bloom_builder *builder, uint32_t total_size = 0; for (uint32_t i = 0; i < key_def->part_count; i++) { + struct tuple_hash_array *hash_arr = &builder->parts[i]; total_size += tuple_hash_key_part(&h, &carry, tuple, &key_def->parts[i]); - if (i < hashed_parts) { + uint32_t hash = PMurHash32_Result(h, carry, total_size); + if (hash_arr->count > 0 && + hash_arr->values[hash_arr->count - 1] == hash) { /* * This part is already in the bloom, proceed - * to the next one. Note, we can't skip to - * hashed_parts, as we need to compute the hash. + * to the next one. Note, this check only works + * if tuples are added in the order defined by + * the key definition. */ continue; } - struct tuple_hash_array *hash_arr = &builder->parts[i]; if (hash_arr->count >= hash_arr->capacity) { uint32_t capacity = MAX(hash_arr->capacity * 2, 1024U); uint32_t *values = realloc(hash_arr->values, @@ -105,7 +107,6 @@ tuple_bloom_builder_add(struct tuple_bloom_builder *builder, hash_arr->capacity = capacity; hash_arr->values = values; } - uint32_t hash = PMurHash32_Result(h, carry, total_size); hash_arr->values[hash_arr->count++] = hash; } return 0; diff --git a/src/box/tuple_bloom.h b/src/box/tuple_bloom.h index b05fee18..8380801e 100644 --- a/src/box/tuple_bloom.h +++ b/src/box/tuple_bloom.h @@ -111,14 +111,14 @@ tuple_bloom_builder_delete(struct tuple_bloom_builder *builder); * @param builder - bloom filter builder * @param tuple - tuple to add * @param key_def - key definition - * @param hashed_parts - number of key parts that have already - * been added to the builder * @return 0 on success, -1 on OOM + * + * For the bloom size calculation to be accurate, tuples + * must be added in the order defined by the key definition. */ int tuple_bloom_builder_add(struct tuple_bloom_builder *builder, - const struct tuple *tuple, struct key_def *key_def, - uint32_t hashed_parts); + const struct tuple *tuple, struct key_def *key_def); /** * Create a new tuple bloom filter. diff --git a/src/box/tuple_compare.cc b/src/box/tuple_compare.cc index cf4519cc..dfe30b19 100644 --- a/src/box/tuple_compare.cc +++ b/src/box/tuple_compare.cc @@ -425,39 +425,6 @@ tuple_compare_field_with_hint(const char *field_a, enum mp_type a_type, } } -uint32_t -tuple_common_key_parts(const struct tuple *tuple_a, const struct tuple *tuple_b, - struct key_def *key_def) -{ - uint32_t i; - struct tuple_format *tuple_a_format = tuple_format(tuple_a); - struct tuple_format *tuple_b_format = tuple_format(tuple_b); - const char *tuple_a_raw = tuple_data(tuple_a); - const char *tuple_b_raw = tuple_data(tuple_b); - const uint32_t *tuple_a_field_map = tuple_field_map(tuple_a); - const uint32_t *tuple_b_field_map = tuple_field_map(tuple_b); - for (i = 0; i < key_def->part_count; i++) { - struct key_part *part = (struct key_part *)&key_def->parts[i]; - const char *field_a = - tuple_field_raw_by_part(tuple_a_format, tuple_a_raw, - tuple_a_field_map, part); - const char *field_b = - tuple_field_raw_by_part(tuple_b_format, tuple_b_raw, - tuple_b_field_map, part); - enum mp_type a_type = field_a != NULL ? - mp_typeof(*field_a) : MP_NIL; - enum mp_type b_type = field_b != NULL ? - mp_typeof(*field_b) : MP_NIL; - if (a_type == MP_NIL && b_type == MP_NIL) - continue; - if (a_type == MP_NIL || b_type == MP_NIL || - tuple_compare_field_with_hint(field_a, a_type, - field_b, b_type, part->type, part->coll) != 0) - break; - } - return i; -} - template<bool is_nullable, bool has_optional_parts, bool has_json_paths> static inline int tuple_compare_slowpath(const struct tuple *tuple_a, const struct tuple *tuple_b, diff --git a/src/box/vy_run.c b/src/box/vy_run.c index 3c096b0a..e2123fe7 100644 --- a/src/box/vy_run.c +++ b/src/box/vy_run.c @@ -2178,11 +2178,8 @@ static int vy_run_writer_write_to_page(struct vy_run_writer *writer, struct tuple *stmt) { if (writer->bloom != NULL) { - uint32_t hashed_parts = writer->last_stmt == NULL ? 0 : - tuple_common_key_parts(stmt, writer->last_stmt, - writer->key_def); tuple_bloom_builder_add(writer->bloom, stmt, - writer->key_def, hashed_parts); + writer->key_def); } if (writer->last_stmt != NULL) vy_stmt_unref_if_possible(writer->last_stmt); @@ -2409,11 +2406,8 @@ vy_run_rebuild_index(struct vy_run *run, const char *dir, if (tuple == NULL) goto close_err; if (bloom_builder != NULL) { - uint32_t hashed_parts = prev_tuple == NULL ? 0 : - tuple_common_key_parts(prev_tuple, - tuple, key_def); tuple_bloom_builder_add(bloom_builder, tuple, - key_def, hashed_parts); + key_def); } key = tuple_extract_key(tuple, cmp_def, NULL); if (prev_tuple != NULL) -- 2.11.0
next prev parent reply other threads:[~2019-03-13 8:52 UTC|newest] Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-03-13 8:52 [PATCH v2 00/14] vinyl: do not fill secondary tuples with nulls Vladimir Davydov 2019-03-13 8:52 ` [PATCH v2 01/14] vinyl: remove optimized comparators Vladimir Davydov 2019-03-13 8:55 ` Konstantin Osipov 2019-03-13 8:52 ` [PATCH v2 02/14] vinyl: introduce statement environment Vladimir Davydov 2019-03-13 8:58 ` Konstantin Osipov 2019-03-13 9:19 ` Vladimir Davydov 2019-03-13 8:52 ` [PATCH v2 03/14] vinyl: rename key stmt construction routine Vladimir Davydov 2019-03-13 8:59 ` Konstantin Osipov 2019-03-13 8:52 ` [PATCH v2 04/14] vinyl: don't use IPROTO_SELECT type for key statements Vladimir Davydov 2019-03-13 9:00 ` Konstantin Osipov 2019-03-13 8:52 ` Vladimir Davydov [this message] 2019-03-13 11:52 ` [PATCH v2 05/14] bloom: do not use tuple_common_key_parts when constructing tuple bloom Konstantin Osipov 2019-03-13 8:52 ` [PATCH v2 06/14] bloom: factor out helper to add tuple hash to bloom builder Vladimir Davydov 2019-03-13 11:52 ` Konstantin Osipov 2019-03-13 8:52 ` [PATCH v2 07/14] vinyl: add helpers to add/check statement with bloom Vladimir Davydov 2019-03-13 11:59 ` Konstantin Osipov 2019-03-13 12:25 ` Vladimir Davydov 2019-03-13 8:52 ` [PATCH v2 08/14] vinyl: do not pass format to vy_apply_upsert Vladimir Davydov 2019-03-13 12:00 ` Konstantin Osipov 2019-03-13 8:52 ` [PATCH v2 09/14] vinyl: clean up write iterator source destruction Vladimir Davydov 2019-03-13 12:05 ` Konstantin Osipov 2019-03-13 8:52 ` [PATCH v2 10/14] vinyl: zap vy_write_iterator->format Vladimir Davydov 2019-03-13 12:06 ` Konstantin Osipov 2019-03-13 8:52 ` [PATCH v2 11/14] vinyl: do not fill secondary tuples with nulls when decoded Vladimir Davydov 2019-03-13 12:25 ` Konstantin Osipov 2019-03-13 12:45 ` Vladimir Davydov 2019-03-13 12:56 ` Konstantin Osipov 2019-03-13 8:52 ` [PATCH v2 12/14] vinyl: zap vy_stmt_new_surrogate_from_key Vladimir Davydov 2019-03-13 12:27 ` Konstantin Osipov 2019-03-13 8:52 ` [PATCH v2 13/14] vinyl: don't use vy_stmt_new_surrogate_delete if not necessary Vladimir Davydov 2019-03-13 12:28 ` Konstantin Osipov 2019-03-13 8:53 ` [PATCH v2 14/14] tuple_format: zap min_tuple_size Vladimir Davydov 2019-03-13 12:28 ` Konstantin Osipov 2019-03-13 15:54 ` [PATCH v2 00/14] vinyl: do not fill secondary tuples with nulls Vladimir Davydov
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=f10f32b2665668ce5dff95344e54fa9096cc8c68.1552464666.git.vdavydov.dev@gmail.com \ --to=vdavydov.dev@gmail.com \ --cc=kostja@tarantool.org \ --cc=tarantool-patches@freelists.org \ --subject='Re: [PATCH v2 05/14] bloom: do not use tuple_common_key_parts when constructing tuple bloom' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox