[tarantool-patches] [PATCH] vinyl: allow to build secondary index for non-empty space

Thu May 24 10:46:03 MSK 2018

On Mon, May 21, 2018 at 06:18:18PM +0300, Vladislav Shpilevoy wrote:
> > +static int
> > +vy_build_insert_tuple(struct vy_env *env, struct vy_lsm *lsm,
> > +		      const char *space_name, const char *index_name,
> > +		      struct tuple_format *new_format, struct tuple *tuple)
> > +{
> >   	/*
> > -	 * During local recovery we are loading existing indexes
> > -	 * from disk, not building new ones.
> > -	 */
> > -	if (env->status != VINYL_INITIAL_RECOVERY_LOCAL &&
> > -	    env->status != VINYL_FINAL_RECOVERY_LOCAL) {
> > -		if (pk->stat.disk.count.rows != 0 ||
> > -		    pk->stat.memory.count.rows != 0) {
> > -			diag_set(ClientError, ER_UNSUPPORTED, "Vinyl",
> > -				 "building an index for a non-empty space");
> > +	 * Use the last LSN to the time, because read iterator
> > +	 * guarantees that this is the newest tuple version.
> > +	 */
> > +	int64_t lsn = env->xm->lsn;
> > +	struct vy_mem *mem = lsm->mem;
> > +
> > +	/* Check the tuple against the new space format. */
> > +	if (tuple_validate(new_format, tuple) != 0)
> > +		return -1;
> > +
> > +	/* Check unique constraint if necessary. */
> > +	if (lsm->check_is_unique &&
> > +	    (!lsm->key_def->is_nullable ||
> > +	     !vy_tuple_key_contains_null(tuple, lsm->key_def))) {

...

> 1. This code is identical to vy_secondary_check_is_unique() introduced by
> me in the review-fixes commit except one thing - it uses committed read
> view, and does not have a transaction. Maybe you can think up how to
> reuse vy_secondary_check_is_unique() here.

I'll look into it.

> 
> >   	}
> > +	/* Reallocate the new tuple using the new space format. */
> > +	uint32_t data_len;
> > +	const char *data = tuple_data_range(tuple, &data_len);
> > +	struct tuple *stmt = vy_stmt_new_replace(new_format, data,
> > +						 data + data_len);
> > +	if (stmt == NULL)
> > +		return -1;
> > +
> > +	/* Insert the new tuple into the in-memory index. */
> > +	size_t mem_used_before = lsregion_used(&env->mem_env.allocator);
> > +	const struct tuple *region_stmt = vy_stmt_dup_lsregion(stmt,
> > +				&mem->env->allocator, mem->generation);
> > +	tuple_unref(stmt);
> 
> 2. I do not like this temporary stmt needed only to reuse
> vy_stmt_dup_lsregion(). Here you can create the region tuple with no
> intermediate tuple on malloc.

I did this simply because there's no helper to create an arbitrary
statement on lsregion, only to duplicate an existing one. I agree that
reallocating a statement is bad as it might slow down rebuild. I'll
check out what we can do about it.

> 
> 3. This code is very similar to vy_lsm_set - it too makes
> vy_stmt_dup_lsregion, vy_mem_insert. Maybe there is a way to reuse
> vy_lsm_set?

Not sure: vy_lsm_set() has some logic that doesn't pertain to this
particular case of mem rebuild, namely reusing region_stmt, upsert
squashing, and checking space format.

> 
> > +	if (region_stmt == NULL)
> > +		return -1;
> > +	vy_stmt_set_lsn((struct tuple *)region_stmt, lsn);
> > +	if (vy_mem_insert(mem, region_stmt) != 0)
> > +		return -1;
> > +
> > +	mem->min_lsn = MIN(mem->min_lsn, lsn);
> > +	mem->max_lsn = MAX(mem->max_lsn, lsn);
> 
> 4. How about move this into vy_mem_insert? It is the second place, where
> min/max_lsn are updated right after vy_mem_insert.

The main use case of vy_mem_insert is inserting prepared statements,
which haven't been assigned LSN yet, so I don't think we can do that.

> 
> > +	vy_stmt_counter_acct_tuple(&lsm->stat.memory.count, region_stmt);
> > +
> > +	size_t mem_used_after = lsregion_used(&env->mem_env.allocator);
> > +	assert(mem_used_after >= mem_used_before);
> > +	vy_quota_force_use(&env->quota, mem_used_after - mem_used_before);
> > +	vy_quota_wait(&env->quota);
> > +	return 0;
> > +}
> > +
> > +/**
> > + * Recover a single statement that was inserted into the space
> > + * while the newly built index was dumped to disk.
> > + */
> > +static int
> > +vy_build_recover_mem_stmt(struct vy_lsm *lsm, struct vy_lsm *pk,
> > +			  const struct tuple *mem_stmt)

...

> 5. I propose this diff (I did not check it works, but looks more readable
> and compact, in my opinion):

Since you're not the only who has complained on this peace code, I'll
try to rewrite it keeping your diff in mind, thanks.

> > +		/*
> > +		 * Read iterator yields only when it reads runs.
> > +		 * Yield periodically in order not to stall the
> > +		 * tx thread in case there are a lot of tuples in
> > +		 * mems or cache.
> > +		 */
> > +		if (++loops % VY_YIELD_LOOPS == 0)
> > +			fiber_sleep(0);
> 
> 6. Why not fiber_yield() ?

fiber_yield() just yields execution to other fibers, but it doesn't give
a chance to the event loop to poll for events so we use fiber_sleep(0)
in order not to block the tx thread while doing a job that might take
long.

> 
> > +		if (ctx.is_failed) {
> > +			diag_move(&ctx.diag, diag_get());
> > +			rc = -1;
> > +			break;
> > +		}
> > +	}
> > +	vy_read_iterator_close(&itr);
> > +	tuple_unref(key);
> > +
> > +	/*
> > +	 * Dump the new index upon build completion so that we don't
> > +	 * have to rebuild it on recovery.
> > +	 */
> > +	if (rc == 0)
> > +		rc = vy_scheduler_dump(&env->scheduler);
> > +
> > +	if (rc == 0 && ctx.is_failed) {
> 
> 7. How is it possible? If ctx.is_failed, then rc is set to -1 above.

vy_scheduler_dump() yields, which opens a time window for the on_replace
trigger to set ctx.is_failed.

> 
> > +		diag_move(&ctx.diag, diag_get());
> > +		rc = -1;
> > +	}
> > +
> > +	diag_destroy(&ctx.diag);
> > +	trigger_clear(&on_replace);
> > +	return rc;
> >   }
> >   static size_t
> > diff --git a/src/box/vy_lsm.c b/src/box/vy_lsm.c
> > index 4dfc0a25..63e18594 100644
> > --- a/src/box/vy_lsm.c
> > +++ b/src/box/vy_lsm.c
> > @@ -743,11 +751,20 @@ int
> >   vy_lsm_set(struct vy_lsm *lsm, struct vy_mem *mem,
> >   	   const struct tuple *stmt, const struct tuple **region_stmt)
> >   {
> > +	uint32_t format_id = stmt->format_id;
> > +
> >   	assert(vy_stmt_is_refable(stmt));
> >   	assert(*region_stmt == NULL || !vy_stmt_is_refable(*region_stmt));
> > -	/* Allocate region_stmt on demand. */
> > -	if (*region_stmt == NULL) {
> > +	/*
> > +	 * Allocate region_stmt on demand.
> > +	 *
> > +	 * Also, reallocate region_stmt if it uses a different tuple
> > +	 * format. This may happen during ALTER, when the LSM tree
> > +	 * that is currently being built uses the new space format
> > +	 * while other LSM trees still use the old space format.
> > +	 */
> > +	if (*region_stmt == NULL || (*region_stmt)->format_id != format_id) {
> 
> 8. Do you really need reset region_stmt in such a case? See this case:
> vy_lsm_set(pk), vy_lsm_set(old_sk1), vy_lsm_set(new_sk), vy_lsm_set(old_sk2)
> 
> Here you create 3 region statements: one for pk, old_sk1; new for new_sk with a
> new format, and again new for old_sk2, because *region_stmt is now has new format,
> but old_sk2 still has old one.
> 
> You can dup stmt on region when format mismatch is detected, but do not reset
> *region_stmt. It allows to create only 2 region statements for the case above.
> Am I right?

Yes, you are. This would make the code more cumbersome though, that's
why I decided to ignore this fact. Since you've noticed it, I fix it.

> 
> >   		*region_stmt = vy_stmt_dup_lsregion(stmt, &mem->env->allocator,
> >   						    mem->generation);
> >   		if (*region_stmt == NULL)
> > diff --git a/src/box/vy_scheduler.h b/src/box/vy_scheduler.h
> > index 4db86152..7dbd6a96 100644
> > --- a/src/box/vy_scheduler.h
> > +++ b/src/box/vy_scheduler.h
> > @@ -202,6 +202,13 @@ void
> >   vy_scheduler_trigger_dump(struct vy_scheduler *scheduler);
> >   /**
> > + * Trigger dump of all currently existing in-memory trees
> > + * and wait until it is complete. Returns 0 on success.
> > + */
> > +int
> > +vy_scheduler_dump(struct vy_scheduler *scheduler);
> 
> 9. Do you really need this at the end of sk building? Why you can not leave
> mems as they are? They will be dumped anyway sometimes.

Yes, we need it for the sake of recovery. The space may have some data
stored on disk at the time we start to build a new index for it. In
order not to reread disk and rebuild index on recovery, we have to make
sure the new index is dumped to disk as soon as the build has completed.