Re: [commits] [tarantool] 02/04: vinyl: introduce bloom filters for partial key lookups

Tarantool development patches archive
 help / color / mirror / Atom feed

From: Vladimir Davydov <vdavydov.dev@gmail.com>
To: Konstantin Osipov <kostja@tarantool.org>,
	tarantool-patches@freelists.org, commits@tarantool.org
Subject: Re: [commits] [tarantool] 02/04: vinyl: introduce bloom filters for partial key lookups
Date: Wed, 28 Mar 2018 14:58:55 +0300	[thread overview]
Message-ID: <20180328115854.hd2iohcrun7h6bui@esperanza> (raw)
In-Reply-To: <20180327210817.GB11829@atlas>

On Wed, Mar 28, 2018 at 12:08:17AM +0300, Konstantin Osipov wrote:
> > +int
> > +tuple_bloom_builder_add(struct tuple_bloom_builder *builder,
> > +			const struct tuple *tuple,
> > +			const struct key_def *key_def,
> > +			uint32_t hashed_parts)
> > +{
> > +	assert(builder->part_count == key_def->part_count);
> > +
> > +	uint32_t h = HASH_SEED;
> > +	uint32_t carry = 0;
> > +	uint32_t total_size = 0;
> > +
> > +	for (uint32_t i = 0; i < key_def->part_count; i++) {
> > +		total_size += tuple_hash_key_part(&h, &carry, tuple,
> > +						  &key_def->parts[i]);
> 
> Is there a faster way to iterate over key parts?
> tuple_hash_key_part() involves tuple_field(). Are you assuming
> tuple_field() always has an offset map available?

Yes, as this is a key part. I can, of course, implement the optimization
done by tuple_hash_slowpath(), but I don't think it's worth it in this
particular case.

> 
> > +		if (i < hashed_parts)
> > +			continue;
> 
> Ugh, looks at first glance that we're invoking a hash function
> and iterating over key parts only to find out the total size. 
> Then I see that you also calculate the hash value and carry as
> a side effect. If we can reuse hashed_parts, can't we reuse the
> hash of the hashed parts and the total size?

I guess we could, but I'm not sure it's worth the complexity it would
introduce. After all, this code is only called from a worker thread, no
point to over-optimize.

> 
> > +		struct tuple_hash_array *hash_arr = &builder->parts[i];
> > +		if (hash_arr->count >= hash_arr->capacity) {
> > +			uint32_t capacity = MAX(hash_arr->capacity * 2, 1024U);
> > +			uint32_t *values = realloc(hash_arr->values,
> > +						   capacity * sizeof(*values));
> > +			if (values == NULL) {
> > +				diag_set(OutOfMemory, capacity * sizeof(*values),
> > +					 "malloc", "tuple hash array");
> > +				return -1;
> > +			}
> > +			hash_arr->capacity = capacity;
> > +			hash_arr->values = values;
> 
> Sounds like we need an array data structure after all...

Yes, there are several places where we have to dynamically grow
an array of entries.

> Although I'd use a linked list of 4k blocks - I purposefully do not allow
> introducing arrays into the source code to avoid inefficient
> applications like this one. 

True, in this particular case it isn't a problem to use a list of blocks
instead of a plain array. I'll rework.

> 
> > +		}
> > +		uint32_t hash = PMurHash32_Result(h, carry, total_size);
> > +		hash_arr->values[hash_arr->count++] = hash;
> > +	}
> > +	return 0;
> > +}
> > +
> > +struct tuple_bloom *
> > +tuple_bloom_new(struct tuple_bloom_builder *builder, double fpr)
> > +{
> > +	uint32_t part_count = builder->part_count;
> > +	size_t size = sizeof(struct tuple_bloom) +
> > +		part_count * sizeof(struct bloom);
> > +	struct tuple_bloom *bloom = malloc(size);
> > +	if (bloom == NULL) {
> > +		diag_set(OutOfMemory, size, "malloc", "tuple bloom");
> > +		return NULL;
> > +	}
> > +	for (uint32_t i = 0; i < part_count; i++) {
> > +		struct tuple_hash_array *hash_arr = &builder->parts[i];
> > +		uint32_t count = hash_arr->count;
> > +		if (bloom_create(&bloom->parts[i], count,
> > +				 fpr, runtime.quota) != 0) {
> > +			diag_set(OutOfMemory, 0, "bloom_create",
> > +				 "tuple bloom part");
> > +			for (uint32_t j = 0; j < i; j++)
> > +				bloom_destroy(&bloom->parts[j], runtime.quota);
> > +			free(bloom);
> 
> Why can't use use tuple_bloom_delete() here?
> Looks like you need to introduce tuple_bloom_init() function for
> that.

If you insist.

> 
> > +			return NULL;
> > +		}
> > +		for (uint32_t k = 0; k < count; k++)
> > +			bloom_add(&bloom->parts[i], hash_arr->values[k]);
> > +	}
> > +	bloom->part_count = part_count;
> > +	return bloom;

next prev parent reply	other threads:[~2018-03-28 11:58 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-27 21:08 Konstantin Osipov
     [not found] ` <20180327210938.GC11829@atlas>
2018-03-28 11:16   ` [commits] [tarantool] 03/04: bloom: optimize tuple bloom filter size Vladimir Davydov
2018-03-28 17:03     ` Konstantin Osipov
2018-03-28 17:19       ` Vladimir Davydov
2018-03-28 11:58 ` Vladimir Davydov [this message]
2018-03-28 17:03   ` [commits] [tarantool] 02/04: vinyl: introduce bloom filters for partial key lookups Konstantin Osipov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180328115854.hd2iohcrun7h6bui@esperanza \
    --to=vdavydov.dev@gmail.com \
    --cc=commits@tarantool.org \
    --cc=kostja@tarantool.org \
    --cc=tarantool-patches@freelists.org \
    --subject='Re: [commits] [tarantool] 02/04: vinyl: introduce bloom filters for partial key lookups' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox