From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Wed, 28 Mar 2018 14:58:55 +0300 From: Vladimir Davydov Subject: Re: [commits] [tarantool] 02/04: vinyl: introduce bloom filters for partial key lookups Message-ID: <20180328115854.hd2iohcrun7h6bui@esperanza> References: <20180327210817.GB11829@atlas> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180327210817.GB11829@atlas> To: Konstantin Osipov , tarantool-patches@freelists.org, commits@tarantool.org List-ID: On Wed, Mar 28, 2018 at 12:08:17AM +0300, Konstantin Osipov wrote: > > +int > > +tuple_bloom_builder_add(struct tuple_bloom_builder *builder, > > + const struct tuple *tuple, > > + const struct key_def *key_def, > > + uint32_t hashed_parts) > > +{ > > + assert(builder->part_count == key_def->part_count); > > + > > + uint32_t h = HASH_SEED; > > + uint32_t carry = 0; > > + uint32_t total_size = 0; > > + > > + for (uint32_t i = 0; i < key_def->part_count; i++) { > > + total_size += tuple_hash_key_part(&h, &carry, tuple, > > + &key_def->parts[i]); > > Is there a faster way to iterate over key parts? > tuple_hash_key_part() involves tuple_field(). Are you assuming > tuple_field() always has an offset map available? Yes, as this is a key part. I can, of course, implement the optimization done by tuple_hash_slowpath(), but I don't think it's worth it in this particular case. > > > + if (i < hashed_parts) > > + continue; > > Ugh, looks at first glance that we're invoking a hash function > and iterating over key parts only to find out the total size. > Then I see that you also calculate the hash value and carry as > a side effect. If we can reuse hashed_parts, can't we reuse the > hash of the hashed parts and the total size? I guess we could, but I'm not sure it's worth the complexity it would introduce. After all, this code is only called from a worker thread, no point to over-optimize. > > > + struct tuple_hash_array *hash_arr = &builder->parts[i]; > > + if (hash_arr->count >= hash_arr->capacity) { > > + uint32_t capacity = MAX(hash_arr->capacity * 2, 1024U); > > + uint32_t *values = realloc(hash_arr->values, > > + capacity * sizeof(*values)); > > + if (values == NULL) { > > + diag_set(OutOfMemory, capacity * sizeof(*values), > > + "malloc", "tuple hash array"); > > + return -1; > > + } > > + hash_arr->capacity = capacity; > > + hash_arr->values = values; > > Sounds like we need an array data structure after all... Yes, there are several places where we have to dynamically grow an array of entries. > Although I'd use a linked list of 4k blocks - I purposefully do not allow > introducing arrays into the source code to avoid inefficient > applications like this one. True, in this particular case it isn't a problem to use a list of blocks instead of a plain array. I'll rework. > > > + } > > + uint32_t hash = PMurHash32_Result(h, carry, total_size); > > + hash_arr->values[hash_arr->count++] = hash; > > + } > > + return 0; > > +} > > + > > +struct tuple_bloom * > > +tuple_bloom_new(struct tuple_bloom_builder *builder, double fpr) > > +{ > > + uint32_t part_count = builder->part_count; > > + size_t size = sizeof(struct tuple_bloom) + > > + part_count * sizeof(struct bloom); > > + struct tuple_bloom *bloom = malloc(size); > > + if (bloom == NULL) { > > + diag_set(OutOfMemory, size, "malloc", "tuple bloom"); > > + return NULL; > > + } > > + for (uint32_t i = 0; i < part_count; i++) { > > + struct tuple_hash_array *hash_arr = &builder->parts[i]; > > + uint32_t count = hash_arr->count; > > + if (bloom_create(&bloom->parts[i], count, > > + fpr, runtime.quota) != 0) { > > + diag_set(OutOfMemory, 0, "bloom_create", > > + "tuple bloom part"); > > + for (uint32_t j = 0; j < i; j++) > > + bloom_destroy(&bloom->parts[j], runtime.quota); > > + free(bloom); > > Why can't use use tuple_bloom_delete() here? > Looks like you need to introduce tuple_bloom_init() function for > that. If you insist. > > > + return NULL; > > + } > > + for (uint32_t k = 0; k < count; k++) > > + bloom_add(&bloom->parts[i], hash_arr->values[k]); > > + } > > + bloom->part_count = part_count; > > + return bloom;